Twitter possesses 330 million monthly active users, which allows businesses to reach a broad population and connect with customers without intermediaries. On the other hand, there is so much information that it is difficult for brands to quickly detect negative social mentions that could harm their business.
Sentiment analysis/classification, which involves monitoring emotions in conversations on social media platforms, has become a key strategy in social media marketing.
Listening to how customers feel about the product/service on Twitter allows companies to understand their audience, keep on top of what is being said about their brand and their competitors, and discover new trends in the industry.
To implement the techniques learned as a part of the course with the following learning outcomes:
* Basic understanding of text pre-processing.
* What to do after text pre-processing
* Bag of words
* Tf-idf
* Build the classification model.
* Evaluate the Model
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").
The dataset has the following columns:
* tweet_id
* airline_sentiment
* airline_sentiment_confidence
* negativereason
* negativereason_confidence
* airline
* airline_sentiment_gold
* name
* negativereason_gold
* retweet_count
* text
* tweet_coord
* tweet_created
* tweet_location
* user_timezone
There are 14,640 rows and 15 columns.
# install and import necessary libraries.
!pip install contractions
import re, string, unicodedata # Import Regex, string and unicodedata.
import contractions # Import contractions library.
from bs4 import BeautifulSoup # Import BeautifulSoup.
import numpy as np # Import numpy.
import pandas as pd # Import pandas.
import nltk # Import Natural Language Tool-Kit.
# Download Stopwords.
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer # Import Lemmatizer.
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")
Requirement already satisfied: contractions in /usr/local/lib/python3.7/dist-packages (0.0.52) Requirement already satisfied: textsearch>=0.0.21 in /usr/local/lib/python3.7/dist-packages (from contractions) (0.0.21) Requirement already satisfied: pyahocorasick in /usr/local/lib/python3.7/dist-packages (from textsearch>=0.0.21->contractions) (1.4.2) Requirement already satisfied: anyascii in /usr/local/lib/python3.7/dist-packages (from textsearch>=0.0.21->contractions) (0.3.0) [nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package wordnet to /root/nltk_data... [nltk_data] Package wordnet is already up-to-date!
# Loading the data from a .csv file into a pandas dataframe.
df_Tweets_Orig = pd.read_csv("Tweets.csv")
df_Tweets_Orig.head()
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 570306133677760513 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | cairdin | NaN | 0 | @VirginAmerica What @dhepburn said. | NaN | 2015-02-24 11:35:52 -0800 | NaN | Eastern Time (US & Canada) |
| 1 | 570301130888122368 | positive | 0.3486 | NaN | 0.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica plus you've added commercials t... | NaN | 2015-02-24 11:15:59 -0800 | NaN | Pacific Time (US & Canada) |
| 2 | 570301083672813571 | neutral | 0.6837 | NaN | NaN | Virgin America | NaN | yvonnalynn | NaN | 0 | @VirginAmerica I didn't today... Must mean I n... | NaN | 2015-02-24 11:15:48 -0800 | Lets Play | Central Time (US & Canada) |
| 3 | 570301031407624196 | negative | 1.0000 | Bad Flight | 0.7033 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica it's really aggressive to blast... | NaN | 2015-02-24 11:15:36 -0800 | NaN | Pacific Time (US & Canada) |
| 4 | 570300817074462722 | negative | 1.0000 | Can't Tell | 1.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica and it's a really big bad thing... | NaN | 2015-02-24 11:14:45 -0800 | NaN | Pacific Time (US & Canada) |
# Make a copy of the original dataframe to work on.
df_Tweets = df_Tweets_Orig.copy()
# View a random sampling of the rows of records.
df_Tweets.sample(n=75, random_state=1)
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8515 | 568198336651649027 | positive | 1.0000 | NaN | NaN | Delta | NaN | GenuineJack | NaN | 0 | @JetBlue I'll pass along the advice. You guys ... | NaN | 2015-02-18 16:00:14 -0800 | Massachusetts | Central Time (US & Canada) |
| 3439 | 568438094652956673 | negative | 0.7036 | Lost Luggage | 0.7036 | United | NaN | vina_love | NaN | 0 | @united I sent you a dm with my file reference... | NaN | 2015-02-19 07:52:57 -0800 | ny | Quito |
| 6439 | 567858373527470080 | positive | 1.0000 | NaN | NaN | Southwest | NaN | Capt_Smirk | NaN | 0 | @SouthwestAir Black History Commercial is real... | NaN | 2015-02-17 17:29:21 -0800 | La Florida | Eastern Time (US & Canada) |
| 5112 | 569336871853170688 | negative | 1.0000 | Late Flight | 1.0000 | Southwest | NaN | scoobydoo9749 | NaN | 0 | @SouthwestAir why am I still in Baltimore?! @d... | [39.1848041, -76.6787131] | 2015-02-21 19:24:22 -0800 | Tallahassee, FL | America/Chicago |
| 5645 | 568839199773732864 | positive | 0.6832 | NaN | NaN | Southwest | NaN | laurafall | NaN | 0 | @SouthwestAir SEA to DEN. South Sound Volleyba... | NaN | 2015-02-20 10:26:48 -0800 | NaN | Pacific Time (US & Canada) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10116 | 569532394455359488 | negative | 1.0000 | Customer Service Issue | 1.0000 | US Airways | NaN | storylaura | NaN | 0 | @USAirways this is the worst customer service ... | NaN | 2015-02-22 08:21:18 -0800 | kansas city, mo | Eastern Time (US & Canada) |
| 3357 | 568502194854612992 | negative | 0.6528 | Bad Flight | 0.3280 | United | NaN | rick4tkins | NaN | 0 | @united tried that already & tried forgett... | NaN | 2015-02-19 12:07:40 -0800 | NaN | NaN |
| 5938 | 568470266793340929 | negative | 1.0000 | Customer Service Issue | 0.6665 | Southwest | NaN | CeceliaNBrady | NaN | 1 | @SouthwestAir 3 flights yesterday; no treats ... | NaN | 2015-02-19 10:00:47 -0800 | NaN | Eastern Time (US & Canada) |
| 4316 | 567634106058821632 | neutral | 1.0000 | NaN | NaN | United | NaN | gwaki | NaN | 0 | @united even though technically after I land I... | NaN | 2015-02-17 02:38:11 -0800 | NaN | Central Time (US & Canada) |
| 9439 | 569938119060955136 | negative | 1.0000 | Late Flight | 0.6617 | US Airways | NaN | DanKolbet | NaN | 0 | @USAirways seriously buy some WD40 for A319 op... | NaN | 2015-02-23 11:13:31 -0800 | Spokane, Washington | Pacific Time (US & Canada) |
75 rows × 15 columns
df_Tweets.shape # Print the shape of the dataset.
(14640, 15)
Observation:
The dataset has 15 columns and 14640 rows of data.
df_Tweets.info() # Information of all columns in the dataframe.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14640 entries, 0 to 14639 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 14640 non-null int64 1 airline_sentiment 14640 non-null object 2 airline_sentiment_confidence 14640 non-null float64 3 negativereason 9178 non-null object 4 negativereason_confidence 10522 non-null float64 5 airline 14640 non-null object 6 airline_sentiment_gold 40 non-null object 7 name 14640 non-null object 8 negativereason_gold 32 non-null object 9 retweet_count 14640 non-null int64 10 text 14640 non-null object 11 tweet_coord 1019 non-null object 12 tweet_created 14640 non-null object 13 tweet_location 9907 non-null object 14 user_timezone 9820 non-null object dtypes: float64(2), int64(2), object(11) memory usage: 1.7+ MB
Observation:
There are lots of nulls present in the attributes:
# View some basic statistical details like percentile, mean, std etc. of a data frame of numeric values.
df_Tweets.describe()
| tweet_id | airline_sentiment_confidence | negativereason_confidence | retweet_count | |
|---|---|---|---|---|
| count | 1.464000e+04 | 14640.000000 | 10522.000000 | 14640.000000 |
| mean | 5.692184e+17 | 0.900169 | 0.638298 | 0.082650 |
| std | 7.791112e+14 | 0.162830 | 0.330440 | 0.745778 |
| min | 5.675883e+17 | 0.335000 | 0.000000 | 0.000000 |
| 25% | 5.685592e+17 | 0.692300 | 0.360600 | 0.000000 |
| 50% | 5.694779e+17 | 1.000000 | 0.670600 | 0.000000 |
| 75% | 5.698905e+17 | 1.000000 | 1.000000 | 0.000000 |
| max | 5.703106e+17 | 1.000000 | 1.000000 | 44.000000 |
df_Tweets.isnull().sum(axis=0) # Check for NULL values.
tweet_id 0 airline_sentiment 0 airline_sentiment_confidence 0 negativereason 5462 negativereason_confidence 4118 airline 0 airline_sentiment_gold 14600 name 0 negativereason_gold 14608 retweet_count 0 text 0 tweet_coord 13621 tweet_created 0 tweet_location 4733 user_timezone 4820 dtype: int64
Observation:
Null counts in the fields as stated earlier above.
plt.figure(figsize=(12,7))
sns.heatmap(df_Tweets.isnull(), cmap = "Blues") # Visualization of missing value using heatmap.
plt.title("Missing values?", fontsize = 15)
plt.show()
Observations:
# Check the missing values for all the columns.
def return_missing_values(data_frame):
missing_values = data_frame.isnull().sum()
missing_values = missing_values[missing_values>0]
missing_values.sort_values(inplace=True)
return missing_values
# Plot the count of missing values in every column.
def plot_missing_values(data_frame):
missing_values = return_missing_values(data_frame)
missing_values = missing_values.to_frame()
missing_values.columns = ['count']
missing_values.index.names = ['Name']
missing_values['Name'] = missing_values.index
sns.set(style='darkgrid')
sns.barplot(x='Name', y='count', data=missing_values)
plt.title('Bar plot for Null Values in each column')
plt.xticks(rotation=90)
plt.show()
# Get the count of missing values in every column of the dataframe.
return_missing_values(df_Tweets)
negativereason_confidence 4118 tweet_location 4733 user_timezone 4820 negativereason 5462 tweet_coord 13621 airline_sentiment_gold 14600 negativereason_gold 14608 dtype: int64
# Plotting the count of missing values.
plot_missing_values(df_Tweets)
Observations:
The additional table and graph further illustrate the earlier observations regarding missing values.
# Get the unique values of every column.
def return_unique_values(data_frame):
unique_dataframe = pd.DataFrame()
unique_dataframe['Features'] = data_frame.columns
uniques = []
for col in data_frame.columns:
u = data_frame[col].nunique()
uniques.append(u)
unique_dataframe['Uniques'] = uniques
return unique_dataframe
# How many unique values are there in each attribute/column?
unidf = return_unique_values(df_Tweets)
print(unidf)
Features Uniques 0 tweet_id 14485 1 airline_sentiment 3 2 airline_sentiment_confidence 1023 3 negativereason 10 4 negativereason_confidence 1410 5 airline 6 6 airline_sentiment_gold 3 7 name 7701 8 negativereason_gold 13 9 retweet_count 18 10 text 14427 11 tweet_coord 832 12 tweet_created 14247 13 tweet_location 3081 14 user_timezone 85
# Plot the count of unique values in every column.
f, ax = plt.subplots(1,1, figsize=(16,5))
sns.barplot(x=unidf['Features'], y=unidf['Uniques'], alpha=0.7)
plt.title('Bar plot for Unique Values in each column')
plt.ylabel('Unique values', fontsize=14)
plt.xlabel('Features', fontsize=14)
plt.xticks(rotation=90)
plt.show()
Observations:
A visualization of the number of unique values in each column. Tweet_id has the most because each tweet is uniquely identified which is most likely of not much value in the model building excercise.
# Plot for Twitter US Airline Sentiment Labels using matplotlib.
colors = ['#ff6666', '#ffcc99', '#99ff99']
sns.set(rc={'figure.figsize':(11.7,8.27)})
plot = plt.pie(df_Tweets['airline_sentiment'].value_counts(), labels=df_Tweets['airline_sentiment'].value_counts().index, colors=colors, startangle=90, autopct='%.2f')
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Pie plot for Twitter US Airline Sentiment Labels')
plt.axis('equal')
plt.tight_layout()
plt.show()
Observations:
# Count number of each type of tweet.
df_Tweets['airline_sentiment'].value_counts()
negative 9178 neutral 3099 positive 2363 Name: airline_sentiment, dtype: int64
# A function to create labeled barplots.
def labeled_barplot(df_Tweets, feature, title, pallet,perc=True, n=None):
"""
Barplot with percentage at the top
df_Tweets: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df_Tweets[feature]) # length of the column
count = df_Tweets[feature].nunique()
if n is None:
plt.figure(figsize=(16, 4))
else:
plt.figure(figsize=(16, 4))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
df_Tweets[feature],
palette=pallet,
order=df_Tweets[feature].value_counts().index[:20],
)
ax.set_title('Frequency of {} tweeting about US Airlines'.format(title))
for p in ax.patches:
if perc == True:
label = "{:1.2f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Visualize the top 20 users by number of tweets.
labeled_barplot(df_Tweets, 'airline', 'Airlines','tab20')
Observations:
# Visualize the top 20 users by number of tweets.
labeled_barplot(df_Tweets, 'name', 'Names','tab20')
Observations:
# Visualize the top 20 sources by number of tweets.
labeled_barplot(df_Tweets, 'tweet_location','Locations', 'tab20')
Observations:
# Take the top 500 user locations based on no of tweets.
dt = df_Tweets['tweet_location'].value_counts().reset_index() # Get the counts of tweets which contains a fixed no of hashtags.
dt.columns = ['tweet_location', 'count']
dt = dt.sort_values(['count'],ascending=False)[:50] # Top 50 places.
dt.head()
| tweet_location | count | |
|---|---|---|
| 0 | Boston, MA | 157 |
| 1 | New York, NY | 156 |
| 2 | Washington, DC | 150 |
| 3 | New York | 127 |
| 4 | USA | 126 |
# Try to make the format into city, state abbreviation for the top 50 places with more number of tweets.
city = []
state = []
for i in dt['tweet_location']:
loc = i.split(',')
if len(loc)>1: # If it has more than one token.
city.append(loc[0])
state.append(loc[1])
else:
city.append('other') # If number of tokens is 1 then we keep it as other.
state.append('other')
dt['city'] = city
dt['state'] = state
dictionary = dict(zip(dt['city'], dt['state'])) # Create a dictionary with key as city and value maps to its state.
dt.head()
| tweet_location | count | city | state | |
|---|---|---|---|---|
| 0 | Boston, MA | 157 | Boston | MA |
| 1 | New York, NY | 156 | New York | NY |
| 2 | Washington, DC | 150 | Washington | DC |
| 3 | New York | 127 | other | other |
| 4 | USA | 126 | other | other |
# Get the final locations.
location = []
for i in dt['tweet_location']:
loc = i.split(',')
if len(loc)==2: # If it has two tokens location will be same.
location.append(loc[0]+','+loc[1])
else:
try:
state = dictionary[loc[0]] # Incase if only city is present we try to map it to the state from the above dataframe.
location.append(loc[0]+','+state)
except:
location.append('other') # If we cant find the map then we leave it.
dt["location"] = location
dt.head()
| tweet_location | count | city | state | location | |
|---|---|---|---|---|---|
| 0 | Boston, MA | 157 | Boston | MA | Boston, MA |
| 1 | New York, NY | 156 | New York | NY | New York, NY |
| 2 | Washington, DC | 150 | Washington | DC | Washington, DC |
| 3 | New York | 127 | other | other | New York, New York |
| 4 | USA | 126 | other | other | other |
# Get the count of tweets from every place.
ds = dt.groupby(['location']).sum().sort_values(by='count', ascending=False).reset_index()
# Get the plot with no of tweets contains x number of tags.
dt = ds
fig = sns.barplot(
x
=dt["count"],
y=dt["location"],
orientation='horizontal'
).set_title('Distribution of number of hashtags in tweets')
Observations:
# Number of new users created every year.
df_Tweets['tweet_created'] = pd.to_datetime(df_Tweets['tweet_created']) # Change the format into datetime readable.
df_Tweets['year_created'] = df_Tweets['tweet_created'].dt.year # Get the year where user is created.
date = df_Tweets.drop_duplicates(subset='name', keep="first") # Remove duplicates.
date = date[date['year_created']>1970] # Consider only user created affter 1970.
date = date['year_created'].value_counts().reset_index() # Get the count of users created every year.
date.columns = ['year', 'number']
plt.figure(figsize=(16, 4))
fig = sns.barplot(
x=date["year"],
y=date["number"],
orientation='vertical'
).set_title('New Users created year by year')
plt.ylabel('count', fontsize=12)
plt.xlabel('year', fontsize=12)
plt.xticks(rotation=90)
plt.show()
Observation:
As per the description of the dataset, all the tweets were generated in 2015 and specifically in the month of February.
df = df_Tweets.sort_values(['tweet_created'])
df['day'] = df['tweet_created'].astype(str).str.split(' ', expand=True)[0]
ds = df['day'].value_counts().reset_index() # Get the count of no of tweets for every day.
ds.columns = ['day', 'count']
ds = ds.sort_values(['day'])
ds['day'] = ds['day'].astype(str)
plt.figure(figsize=(16, 7))
fig = sns.barplot(
x=ds['count'],
y=ds["day"],
orientation="horizontal",
).set_title('Tweets distribution over days present in dataset')
Observations:
df_Tweets['tweet_created'] = pd.to_datetime(df_Tweets['tweet_created']) # Change the format into datetime readable.
df_Tweets['hour'] = df_Tweets['tweet_created'].dt.hour # Get the hour of every tweet.
ds = df_Tweets['hour'].value_counts().reset_index() # Get the count of no of tweets for every hour.
ds.columns = ['hour', 'count']
ds = ds.sort_values(['hour'])
ds['hour'] = 'Hour ' + ds['hour'].astype(str)
plt.figure(figsize=(16, 7))
fig = sns.barplot(
x=ds["hour"],
y=ds["count"],
orientation='vertical',
).set_title('Tweets distribution over hours')
plt.xticks(rotation='vertical')
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23]),
<a list of 24 Text major ticklabel objects>)
Observations:
# Get the no of words in every text.
df_Tweets['word_count'] = [len(t.split()) for t in df_Tweets.text]
df_Tweets.head()
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | year_created | hour | word_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 570306133677760513 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | cairdin | NaN | 0 | @VirginAmerica What @dhepburn said. | NaN | 2015-02-24 11:35:52-08:00 | NaN | Eastern Time (US & Canada) | 2015 | 11 | 4 |
| 1 | 570301130888122368 | positive | 0.3486 | NaN | 0.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica plus you've added commercials t... | NaN | 2015-02-24 11:15:59-08:00 | NaN | Pacific Time (US & Canada) | 2015 | 11 | 9 |
| 2 | 570301083672813571 | neutral | 0.6837 | NaN | NaN | Virgin America | NaN | yvonnalynn | NaN | 0 | @VirginAmerica I didn't today... Must mean I n... | NaN | 2015-02-24 11:15:48-08:00 | Lets Play | Central Time (US & Canada) | 2015 | 11 | 12 |
| 3 | 570301031407624196 | negative | 1.0000 | Bad Flight | 0.7033 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica it's really aggressive to blast... | NaN | 2015-02-24 11:15:36-08:00 | NaN | Pacific Time (US & Canada) | 2015 | 11 | 17 |
| 4 | 570300817074462722 | negative | 1.0000 | Can't Tell | 1.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica and it's a really big bad thing... | NaN | 2015-02-24 11:14:45-08:00 | NaN | Pacific Time (US & Canada) | 2015 | 11 | 10 |
# Get the no of words in every text.
df_Tweets['word_count'] = [len(t.split()) for t in df_Tweets.text]
df_Tweets.head(100)
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | year_created | hour | word_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 570306133677760513 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | cairdin | NaN | 0 | @VirginAmerica What @dhepburn said. | NaN | 2015-02-24 11:35:52-08:00 | NaN | Eastern Time (US & Canada) | 2015 | 11 | 4 |
| 1 | 570301130888122368 | positive | 0.3486 | NaN | 0.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica plus you've added commercials t... | NaN | 2015-02-24 11:15:59-08:00 | NaN | Pacific Time (US & Canada) | 2015 | 11 | 9 |
| 2 | 570301083672813571 | neutral | 0.6837 | NaN | NaN | Virgin America | NaN | yvonnalynn | NaN | 0 | @VirginAmerica I didn't today... Must mean I n... | NaN | 2015-02-24 11:15:48-08:00 | Lets Play | Central Time (US & Canada) | 2015 | 11 | 12 |
| 3 | 570301031407624196 | negative | 1.0000 | Bad Flight | 0.7033 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica it's really aggressive to blast... | NaN | 2015-02-24 11:15:36-08:00 | NaN | Pacific Time (US & Canada) | 2015 | 11 | 17 |
| 4 | 570300817074462722 | negative | 1.0000 | Can't Tell | 1.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica and it's a really big bad thing... | NaN | 2015-02-24 11:14:45-08:00 | NaN | Pacific Time (US & Canada) | 2015 | 11 | 10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 569910981868060673 | negative | 1.0000 | Customer Service Issue | 0.6863 | Virgin America | NaN | MerchEngines | NaN | 0 | @VirginAmerica Is it me, or is your website do... | NaN | 2015-02-23 09:25:41-08:00 | Los Angeles, CA | Arizona | 2015 | 9 | 22 |
| 96 | 569909224521641984 | negative | 1.0000 | Customer Service Issue | 0.6771 | Virgin America | NaN | ColorCartel | NaN | 0 | @VirginAmerica I can't check in or add a bag. ... | NaN | 2015-02-23 09:18:42-08:00 | Austin, TX | Mountain Time (US & Canada) | 2015 | 9 | 20 |
| 97 | 569907336485019648 | negative | 1.0000 | Can't Tell | 0.6590 | Virgin America | NaN | MustBeSpoken | NaN | 0 | @VirginAmerica - Let 2 scanned in passengers l... | NaN | 2015-02-23 09:11:12-08:00 | NaN | NaN | 2015 | 9 | 22 |
| 98 | 569896805611089920 | negative | 1.0000 | Flight Booking Problems | 0.6714 | Virgin America | NaN | mattbunk | NaN | 0 | @virginamerica What is your phone number. I ca... | NaN | 2015-02-23 08:29:21-08:00 | Sterling Heights, MI | Eastern Time (US & Canada) | 2015 | 8 | 16 |
| 99 | 569894449620369408 | negative | 1.0000 | Customer Service Issue | 1.0000 | Virgin America | NaN | louisjenny | NaN | 0 | @VirginAmerica is anyone doing anything there ... | NaN | 2015-02-23 08:19:59-08:00 | Washington DC | Quito | 2015 | 8 | 17 |
100 rows × 18 columns
Data Pre-processing steps:
df_Tweets = df_Tweets[["text","airline_sentiment"]]
pd.set_option('display.max_colwidth', None) # Display full dataframe information (Non-truncated text column.)
df_Tweets.head(100)
| text | airline_sentiment | |
|---|---|---|
| 0 | @VirginAmerica What @dhepburn said. | neutral |
| 1 | @VirginAmerica plus you've added commercials to the experience... tacky. | positive |
| 2 | @VirginAmerica I didn't today... Must mean I need to take another trip! | neutral |
| 3 | @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse | negative |
| 4 | @VirginAmerica and it's a really big bad thing about it | negative |
| ... | ... | ... |
| 95 | @VirginAmerica Is it me, or is your website down? BTW, your new website isn't a great user experience. Time for another redesign. | negative |
| 96 | @VirginAmerica I can't check in or add a bag. Your website isn't working. I've tried both desktop and mobile http://t.co/AvyqdMpi1Y | negative |
| 97 | @VirginAmerica - Let 2 scanned in passengers leave the plane than told someone to remove their bag from 1st class bin? #uncomfortable | negative |
| 98 | @virginamerica What is your phone number. I can't find who to call about a flight reservation. | negative |
| 99 | @VirginAmerica is anyone doing anything there today? Website is useless and no one is answering the phone. | negative |
100 rows × 2 columns
# Remove the html tags.
def strip_html(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()
# Expand the contractions.
def replace_contractions(text):
"""Replace contractions in string of text"""
return contractions.fix(text)
# Remove the numericals present in the text.
def remove_numbers(text):
text = re.sub(r'\d+', '', text)
return text
# Remove the url's present in the text.
def remove_url(text):
text = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',text)
return text
# Remove the mentions in the tweets.
def remove_mention(text):
text = re.sub(r'@\w+','',text)
return text
def clean_text(text):
text = strip_html(text)
text = replace_contractions(text)
text = remove_numbers(text)
text = remove_url(text)
text = remove_mention(text)
return text
df_Tweets['text'] = df_Tweets['text'].apply(lambda x: clean_text(x))
df_Tweets.head(100)
| text | airline_sentiment | |
|---|---|---|
| 0 | What said. | neutral |
| 1 | plus you have added commercials to the experience... tacky. | positive |
| 2 | I did not today... Must mean I need to take another trip! | neutral |
| 3 | it is really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse | negative |
| 4 | and it is a really big bad thing about it | negative |
| ... | ... | ... |
| 95 | Is it me, or is your website down? BTW, your new website is not a great user experience. Time for another redesign. | negative |
| 96 | I cannot check in or add a bag. Your website is not working. I have tried both desktop and mobile | negative |
| 97 | - Let scanned in passengers leave the plane than told someone to remove their bag from st class bin? #uncomfortable | negative |
| 98 | What is your phone number. I cannot find who to call about a flight reservation. | negative |
| 99 | is anyone doing anything there today? Website is useless and no one is answering the phone. | negative |
100 rows × 2 columns
df_Tweets['text'] = df_Tweets.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) # Tokenization of data
df_Tweets.head(100)
| text | airline_sentiment | |
|---|---|---|
| 0 | [What, said, .] | neutral |
| 1 | [plus, you, have, added, commercials, to, the, experience, ..., tacky, .] | positive |
| 2 | [I, did, not, today, ..., Must, mean, I, need, to, take, another, trip, !] | neutral |
| 3 | [it, is, really, aggressive, to, blast, obnoxious, ``, entertainment, '', in, your, guests, ', faces, &, they, have, little, recourse] | negative |
| 4 | [and, it, is, a, really, big, bad, thing, about, it] | negative |
| ... | ... | ... |
| 95 | [Is, it, me, ,, or, is, your, website, down, ?, BTW, ,, your, new, website, is, not, a, great, user, experience, ., Time, for, another, redesign, .] | negative |
| 96 | [I, can, not, check, in, or, add, a, bag, ., Your, website, is, not, working, ., I, have, tried, both, desktop, and, mobile] | negative |
| 97 | [-, Let, scanned, in, passengers, leave, the, plane, than, told, someone, to, remove, their, bag, from, st, class, bin, ?, #, uncomfortable] | negative |
| 98 | [What, is, your, phone, number, ., I, can, not, find, who, to, call, about, a, flight, reservation, .] | negative |
| 99 | [is, anyone, doing, anything, there, today, ?, Website, is, useless, and, no, one, is, answering, the, phone, .] | negative |
100 rows × 2 columns
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
stopwords = stopwords.words('english')
stopwords = list(set(stopwords))
lemmatizer = WordNetLemmatizer()
# Remove the non-ASCII characters.
def remove_non_ascii(words):
"""Remove non-ASCII characters from list of tokenized words"""
new_words = []
for word in words:
new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
new_words.append(new_word)
return new_words
# Convert all characters to lowercase.
def to_lowercase(words):
"""Convert all characters to lowercase from list of tokenized words"""
new_words = []
for word in words:
new_word = word.lower()
new_words.append(new_word)
return new_words
# Remove the hashtags.
def remove_hash(text):
"""Remove hashtags from list of tokenized words"""
new_words = []
for word in words:
new_word = re.sub(r'#\w+','',word)
if new_word != '':
new_words.append(new_word)
return new_words
# Remove the punctuations.
def remove_punctuation(words):
"""Remove punctuation from list of tokenized words"""
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
return new_words
# Remove the stop words.
def remove_stopwords(words):
"""Remove stop words from list of tokenized words"""
new_words = []
for word in words:
if word not in stopwords:
new_words.append(word)
return new_words
# Lemmatize the words.
def lemmatize_list(words):
new_words = []
for word in words:
new_words.append(lemmatizer.lemmatize(word, pos='v'))
return new_words
def normalize(words):
words = remove_non_ascii(words)
words = to_lowercase(words)
words = remove_punctuation(words)
words = remove_stopwords(words)
words = lemmatize_list(words)
return ' '.join(words)
df_Tweets['text'] = df_Tweets.apply(lambda row: normalize(row['text']), axis=1)
df_Tweets.head(100)
| text | airline_sentiment | |
|---|---|---|
| 0 | say | neutral |
| 1 | plus add commercials experience tacky | positive |
| 2 | today must mean need take another trip | neutral |
| 3 | really aggressive blast obnoxious entertainment guests face little recourse | negative |
| 4 | really big bad thing | negative |
| ... | ... | ... |
| 95 | website btw new website great user experience time another redesign | negative |
| 96 | check add bag website work try desktop mobile | negative |
| 97 | let scan passengers leave plane tell someone remove bag st class bin uncomfortable | negative |
| 98 | phone number find call flight reservation | negative |
| 99 | anyone anything today website useless one answer phone | negative |
100 rows × 2 columns
Observations:
Creating a Word Cloud for all the Tweets.
# Install the WordCloud library.
!pip install wordcloud
Requirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (1.5.0) Requirement already satisfied: numpy>=1.6.1 in /usr/local/lib/python3.7/dist-packages (from wordcloud) (1.19.5) Requirement already satisfied: pillow in /usr/local/lib/python3.7/dist-packages (from wordcloud) (7.1.2)
# Importing all necessary modules.
from wordcloud import WordCloud
from wordcloud import STOPWORDS
import matplotlib.pyplot as plt
stopword_list = set(STOPWORDS)
word_lists = df_Tweets['text']
unique_str = ' '.join(word_lists)
# Generate_wordcloud(unique_str).
word_cloud = WordCloud(width = 3000, height = 2500,
background_color ='white',
stopwords = stopword_list,
min_font_size = 10).generate(unique_str)
# Visualize the WordCloud Plot.
# Set wordcloud figure size.
plt.figure(1,figsize=(12, 12))
# Show image.
plt.imshow(word_cloud)
# Remove Axis.
plt.axis("off")
# Show plot.
plt.show()
Observations:
Words like "flight", "thank", "help", "customer service", "bag", "time", "need", "make", and "fly" are the most frequently used in the overall dataset.
Word Cloud for Negative Tweets.
# Create a dataset of the negative tweets.
negative_tweets=df_Tweets[df_Tweets['airline_sentiment']=='negative']
word_lists_neg = ' '.join(negative_tweets['text'])
# Set the display parameters for the word cloud.
wordcloud = WordCloud(stopwords=stopword_list,
background_color='red',
width=3000,
height=2500
).generate(word_lists_neg)
# Display the word cloud.
plt.figure(1,figsize=(12, 12))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
Observations:
Amongst the negative tweets, the most frequent words are "flight", "time", "help", "bag", "customer service", "go", "cancel", "flightled", "delay" etc.
Word Cloud for Positive Tweets.
# Create a dataset of the positive tweets.
positive_tweets=df_Tweets[df_Tweets['airline_sentiment']=='positive']
word_lists_pos = ' '.join(positive_tweets['text'])
# Set the display parameters for the word cloud.
wordcloud = WordCloud(stopwords=stopword_list,
background_color='green',
width=3000,
height=2500
).generate(word_lists_pos)
# Display the word cloud.
plt.figure(1,figsize=(12, 12))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
Observations:
The positive tweets have frequent words apearing such as "flight", "great", "thank", "love", "awesome", "appreciate", "customer service", "fly", "amaze" etc.
Word Cloud for Neutral Tweets.
# Create a dataset of the neutral tweets.
neutral_tweets=df_Tweets[df_Tweets['airline_sentiment']=='neutral']
word_lists_neu = ' '.join(neutral_tweets['text'])
# Set the display parameters for the word cloud.
wordcloud = WordCloud(stopwords=stopword_list,
background_color='orange',
width=3000,
height=2500
).generate(word_lists_neu)
# Display the word cloud.
plt.figure(1,figsize=(12, 12))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
Observations:
The neutral tweets have frequent words such as "flight", "thank", "fly", "need", "please", "go", "ticket" etc.
# Import Keras libraries.
from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense, Embedding
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelBinarizer
from keras.layers import Activation, Dropout
from keras.models import Sequential
df_Tweets_K = df_Tweets_Orig.copy()
# Size of our dataset.
df_Tweets_K.shape
(14640, 15)
# Look at the first 25 rows
df_Tweets_K.head(25)
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 570306133677760513 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | cairdin | NaN | 0 | @VirginAmerica What @dhepburn said. | NaN | 2015-02-24 11:35:52 -0800 | NaN | Eastern Time (US & Canada) |
| 1 | 570301130888122368 | positive | 0.3486 | NaN | 0.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica plus you've added commercials to the experience... tacky. | NaN | 2015-02-24 11:15:59 -0800 | NaN | Pacific Time (US & Canada) |
| 2 | 570301083672813571 | neutral | 0.6837 | NaN | NaN | Virgin America | NaN | yvonnalynn | NaN | 0 | @VirginAmerica I didn't today... Must mean I need to take another trip! | NaN | 2015-02-24 11:15:48 -0800 | Lets Play | Central Time (US & Canada) |
| 3 | 570301031407624196 | negative | 1.0000 | Bad Flight | 0.7033 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse | NaN | 2015-02-24 11:15:36 -0800 | NaN | Pacific Time (US & Canada) |
| 4 | 570300817074462722 | negative | 1.0000 | Can't Tell | 1.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica and it's a really big bad thing about it | NaN | 2015-02-24 11:14:45 -0800 | NaN | Pacific Time (US & Canada) |
| 5 | 570300767074181121 | negative | 1.0000 | Can't Tell | 0.6842 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA | NaN | 2015-02-24 11:14:33 -0800 | NaN | Pacific Time (US & Canada) |
| 6 | 570300616901320704 | positive | 0.6745 | NaN | 0.0000 | Virgin America | NaN | cjmcginnis | NaN | 0 | @VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :) | NaN | 2015-02-24 11:13:57 -0800 | San Francisco CA | Pacific Time (US & Canada) |
| 7 | 570300248553349120 | neutral | 0.6340 | NaN | NaN | Virgin America | NaN | pilot | NaN | 0 | @VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP | NaN | 2015-02-24 11:12:29 -0800 | Los Angeles | Pacific Time (US & Canada) |
| 8 | 570299953286942721 | positive | 0.6559 | NaN | NaN | Virgin America | NaN | dhepburn | NaN | 0 | @virginamerica Well, I didn't…but NOW I DO! :-D | NaN | 2015-02-24 11:11:19 -0800 | San Diego | Pacific Time (US & Canada) |
| 9 | 570295459631263746 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | YupitsTate | NaN | 0 | @VirginAmerica it was amazing, and arrived an hour early. You're too good to me. | NaN | 2015-02-24 10:53:27 -0800 | Los Angeles | Eastern Time (US & Canada) |
| 10 | 570294189143031808 | neutral | 0.6769 | NaN | 0.0000 | Virgin America | NaN | idk_but_youtube | NaN | 0 | @VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24 | NaN | 2015-02-24 10:48:24 -0800 | 1/1 loner squad | Eastern Time (US & Canada) |
| 11 | 570289724453216256 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | HyperCamiLax | NaN | 0 | @VirginAmerica I <3 pretty graphics. so much better than minimal iconography. :D | NaN | 2015-02-24 10:30:40 -0800 | NYC | America/New_York |
| 12 | 570289584061480960 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | HyperCamiLax | NaN | 0 | @VirginAmerica This is such a great deal! Already thinking about my 2nd trip to @Australia & I haven't even gone on my 1st trip yet! ;p | NaN | 2015-02-24 10:30:06 -0800 | NYC | America/New_York |
| 13 | 570287408438120448 | positive | 0.6451 | NaN | NaN | Virgin America | NaN | mollanderson | NaN | 0 | @VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel http://t.co/ahlXHhKiyn | NaN | 2015-02-24 10:21:28 -0800 | NaN | Eastern Time (US & Canada) |
| 14 | 570285904809598977 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | sjespers | NaN | 0 | @VirginAmerica Thanks! | NaN | 2015-02-24 10:15:29 -0800 | San Francisco, CA | Pacific Time (US & Canada) |
| 15 | 570282469121007616 | negative | 0.6842 | Late Flight | 0.3684 | Virgin America | NaN | smartwatermelon | NaN | 0 | @VirginAmerica SFO-PDX schedule is still MIA. | NaN | 2015-02-24 10:01:50 -0800 | palo alto, ca | Pacific Time (US & Canada) |
| 16 | 570277724385734656 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | ItzBrianHunty | NaN | 0 | @VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo | NaN | 2015-02-24 09:42:59 -0800 | west covina | Pacific Time (US & Canada) |
| 17 | 570276917301137409 | negative | 1.0000 | Bad Flight | 1.0000 | Virgin America | NaN | heatherovieda | NaN | 0 | @VirginAmerica I flew from NYC to SFO last week and couldn't fully sit in my seat due to two large gentleman on either side of me. HELP! | NaN | 2015-02-24 09:39:46 -0800 | this place called NYC | Eastern Time (US & Canada) |
| 18 | 570270684619923457 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | thebrandiray | NaN | 0 | I ❤️ flying @VirginAmerica. ☺️👍 | NaN | 2015-02-24 09:15:00 -0800 | Somewhere celebrating life. | Atlantic Time (Canada) |
| 19 | 570267956648792064 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | JNLpierce | NaN | 0 | @VirginAmerica you know what would be amazingly awesome? BOS-FLL PLEASE!!!!!!! I want to fly with only you. | NaN | 2015-02-24 09:04:10 -0800 | Boston | Waltham | Quito |
| 20 | 570265883513384960 | negative | 0.6705 | Can't Tell | 0.3614 | Virgin America | NaN | MISSGJ | NaN | 0 | @VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select??? | NaN | 2015-02-24 08:55:56 -0800 | NaN | NaN |
| 21 | 570264145116819457 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | DT_Les | NaN | 0 | @VirginAmerica I love this graphic. http://t.co/UT5GrRwAaA | [40.74804263, -73.99295302] | 2015-02-24 08:49:01 -0800 | NaN | NaN |
| 22 | 570259420287868928 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | ElvinaBeck | NaN | 0 | @VirginAmerica I love the hipster innovation. You are a feel good brand. | NaN | 2015-02-24 08:30:15 -0800 | Los Angeles | Pacific Time (US & Canada) |
| 23 | 570258822297579520 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | rjlynch21086 | NaN | 0 | @VirginAmerica will you be making BOS>LAS non stop permanently anytime soon? | NaN | 2015-02-24 08:27:52 -0800 | Boston, MA | Eastern Time (US & Canada) |
| 24 | 570256553502068736 | negative | 1.0000 | Customer Service Issue | 0.3557 | Virgin America | NaN | ayeevickiee | NaN | 0 | @VirginAmerica you guys messed up my seating.. I reserved seating with my friends and you guys gave my seat away ... 😡 I want free internet | NaN | 2015-02-24 08:18:51 -0800 | 714 | Mountain Time (US & Canada) |
Dividing the Keras dataset into train and test.
# Before we feed our data into Keras deep learning algorithms, we divided our data into training and test sets as shown below.
X = df_Tweets_K.iloc[:, 10].values
y = df_Tweets_K.iloc[:, 1].values
# The “iloc” function takes the index that we want to filter from our dataset.
# Since our feature set will consist of tweet text, which is the 11th column, we passed 10 to the first iloc function.
# Since column index starts from 0, therefore the 10th index corresponds to the 11th column.
# Similarly, for labels, we pass the 1st index.
# Now variable X contains our feature set while the variable y contains our labels or the output.
Text cleaning.
# Remove the html tags.
def strip_html(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()
# Expand the contractions.
def replace_contractions(text):
"""Replace contractions in string of text"""
return contractions.fix(text)
# Remove the numericals present in the text.
def remove_numbers(text):
text = re.sub(r'\d+', '', text)
return text
# Remove the url's present in the text.
def remove_url(text):
text = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',text)
return text
# Remove the mentions in the tweets.
def remove_mention(text):
text = re.sub(r'@\w+','',text)
return text
def clean_text(text):
text = strip_html(text)
text = replace_contractions(text)
text = remove_numbers(text)
text = remove_url(text)
text = remove_mention(text)
return text
df_Tweets_K['text'] = df_Tweets_K['text'].apply(lambda x: clean_text(x))
df_Tweets_K.head(100)
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 570306133677760513 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | cairdin | NaN | 0 | What said. | NaN | 2015-02-24 11:35:52 -0800 | NaN | Eastern Time (US & Canada) |
| 1 | 570301130888122368 | positive | 0.3486 | NaN | 0.0000 | Virgin America | NaN | jnardino | NaN | 0 | plus you have added commercials to the experience... tacky. | NaN | 2015-02-24 11:15:59 -0800 | NaN | Pacific Time (US & Canada) |
| 2 | 570301083672813571 | neutral | 0.6837 | NaN | NaN | Virgin America | NaN | yvonnalynn | NaN | 0 | I did not today... Must mean I need to take another trip! | NaN | 2015-02-24 11:15:48 -0800 | Lets Play | Central Time (US & Canada) |
| 3 | 570301031407624196 | negative | 1.0000 | Bad Flight | 0.7033 | Virgin America | NaN | jnardino | NaN | 0 | it is really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse | NaN | 2015-02-24 11:15:36 -0800 | NaN | Pacific Time (US & Canada) |
| 4 | 570300817074462722 | negative | 1.0000 | Can't Tell | 1.0000 | Virgin America | NaN | jnardino | NaN | 0 | and it is a really big bad thing about it | NaN | 2015-02-24 11:14:45 -0800 | NaN | Pacific Time (US & Canada) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 569910981868060673 | negative | 1.0000 | Customer Service Issue | 0.6863 | Virgin America | NaN | MerchEngines | NaN | 0 | Is it me, or is your website down? BTW, your new website is not a great user experience. Time for another redesign. | NaN | 2015-02-23 09:25:41 -0800 | Los Angeles, CA | Arizona |
| 96 | 569909224521641984 | negative | 1.0000 | Customer Service Issue | 0.6771 | Virgin America | NaN | ColorCartel | NaN | 0 | I cannot check in or add a bag. Your website is not working. I have tried both desktop and mobile | NaN | 2015-02-23 09:18:42 -0800 | Austin, TX | Mountain Time (US & Canada) |
| 97 | 569907336485019648 | negative | 1.0000 | Can't Tell | 0.6590 | Virgin America | NaN | MustBeSpoken | NaN | 0 | - Let scanned in passengers leave the plane than told someone to remove their bag from st class bin? #uncomfortable | NaN | 2015-02-23 09:11:12 -0800 | NaN | NaN |
| 98 | 569896805611089920 | negative | 1.0000 | Flight Booking Problems | 0.6714 | Virgin America | NaN | mattbunk | NaN | 0 | What is your phone number. I cannot find who to call about a flight reservation. | NaN | 2015-02-23 08:29:21 -0800 | Sterling Heights, MI | Eastern Time (US & Canada) |
| 99 | 569894449620369408 | negative | 1.0000 | Customer Service Issue | 1.0000 | Virgin America | NaN | louisjenny | NaN | 0 | is anyone doing anything there today? Website is useless and no one is answering the phone. | NaN | 2015-02-23 08:19:59 -0800 | Washington DC | Quito |
100 rows × 15 columns
df_Tweets_K['text'] = df_Tweets_K.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) # Tokenization of the dataset.
df_Tweets_K.head(100)
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 570306133677760513 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | cairdin | NaN | 0 | [What, said, .] | NaN | 2015-02-24 11:35:52 -0800 | NaN | Eastern Time (US & Canada) |
| 1 | 570301130888122368 | positive | 0.3486 | NaN | 0.0000 | Virgin America | NaN | jnardino | NaN | 0 | [plus, you, have, added, commercials, to, the, experience, ..., tacky, .] | NaN | 2015-02-24 11:15:59 -0800 | NaN | Pacific Time (US & Canada) |
| 2 | 570301083672813571 | neutral | 0.6837 | NaN | NaN | Virgin America | NaN | yvonnalynn | NaN | 0 | [I, did, not, today, ..., Must, mean, I, need, to, take, another, trip, !] | NaN | 2015-02-24 11:15:48 -0800 | Lets Play | Central Time (US & Canada) |
| 3 | 570301031407624196 | negative | 1.0000 | Bad Flight | 0.7033 | Virgin America | NaN | jnardino | NaN | 0 | [it, is, really, aggressive, to, blast, obnoxious, ``, entertainment, '', in, your, guests, ', faces, &, they, have, little, recourse] | NaN | 2015-02-24 11:15:36 -0800 | NaN | Pacific Time (US & Canada) |
| 4 | 570300817074462722 | negative | 1.0000 | Can't Tell | 1.0000 | Virgin America | NaN | jnardino | NaN | 0 | [and, it, is, a, really, big, bad, thing, about, it] | NaN | 2015-02-24 11:14:45 -0800 | NaN | Pacific Time (US & Canada) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 569910981868060673 | negative | 1.0000 | Customer Service Issue | 0.6863 | Virgin America | NaN | MerchEngines | NaN | 0 | [Is, it, me, ,, or, is, your, website, down, ?, BTW, ,, your, new, website, is, not, a, great, user, experience, ., Time, for, another, redesign, .] | NaN | 2015-02-23 09:25:41 -0800 | Los Angeles, CA | Arizona |
| 96 | 569909224521641984 | negative | 1.0000 | Customer Service Issue | 0.6771 | Virgin America | NaN | ColorCartel | NaN | 0 | [I, can, not, check, in, or, add, a, bag, ., Your, website, is, not, working, ., I, have, tried, both, desktop, and, mobile] | NaN | 2015-02-23 09:18:42 -0800 | Austin, TX | Mountain Time (US & Canada) |
| 97 | 569907336485019648 | negative | 1.0000 | Can't Tell | 0.6590 | Virgin America | NaN | MustBeSpoken | NaN | 0 | [-, Let, scanned, in, passengers, leave, the, plane, than, told, someone, to, remove, their, bag, from, st, class, bin, ?, #, uncomfortable] | NaN | 2015-02-23 09:11:12 -0800 | NaN | NaN |
| 98 | 569896805611089920 | negative | 1.0000 | Flight Booking Problems | 0.6714 | Virgin America | NaN | mattbunk | NaN | 0 | [What, is, your, phone, number, ., I, can, not, find, who, to, call, about, a, flight, reservation, .] | NaN | 2015-02-23 08:29:21 -0800 | Sterling Heights, MI | Eastern Time (US & Canada) |
| 99 | 569894449620369408 | negative | 1.0000 | Customer Service Issue | 1.0000 | Virgin America | NaN | louisjenny | NaN | 0 | [is, anyone, doing, anything, there, today, ?, Website, is, useless, and, no, one, is, answering, the, phone, .] | NaN | 2015-02-23 08:19:59 -0800 | Washington DC | Quito |
100 rows × 15 columns
# Tweets contain many special characters such as @, #, – etc. Similarly, there can be many empty spaces.
# These special characters and empty spaces normally do not help in classification,
# therefore we clean our text before using it for deep learning purposes.
# The following script performs text cleaning tasks.
df_Tweets_K['text'] = df_Tweets_K.apply(lambda row: normalize(row['text']), axis=1)
# The script above removes all the special characters from the tweets,
# then removes single spaces from the beginning.
# Then all the multiple spaces, generated as the result of removing special characters,
# are removed. Finally, for the sake of uniformity, all the text is converted to lower case.
# Regular expressions are used in the above script for text cleaning tasks.
# Divide our data into a training and test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Converting Text to Numbers:
Statistical approaches like machine learning and deep learning work with numbers. However, the data we have is in the form of text. We need to convert the textual data into the numeric form. Several approaches exist to convert text to numbers such as bag of words, TFIDF and word2vec.
To convert text to numbers, we can use the “Tokenizer” class from the “keras.preprocessing.text” library. The constructor for the “Tokenizer” class takes “num_words” as a parameter which can be used to specify the minimum threshold for the most frequently occurring words. This can be helpful since the words that occur less number of times than a certain threshold are not very helpful for classification. Next, we need to call the “fit_on_text()” method to train the tokenizer. Once we train the tokenizer, we can use it to convert text to numbers using the “text_to_matrix()” function. The “mode” parameter specifies the scheme that we want to use for the conversion. We used TFIDF scheme owing to its simplicity and efficiency. The following script converts, text to numbers.
vocab_size = 1000
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(X_train)
train_tweets = tokenizer.texts_to_matrix(X_train, mode='tfidf')
test_tweets = tokenizer.texts_to_matrix(X_test, mode='tfidf')
# Our labels are also in the form of text e.g. positive, neutral and negative.
# We need to convert it into text as well.
# To do so, we used the “LabelBinarizer()” from the “sklearn.preprocessing” library.
encoder = LabelBinarizer()
encoder.fit(y_train)
train_sentiment = encoder.transform(y_train)
test_sentiment = encoder.transform(y_test)
# Set the neural network parameters and hyperparameters.
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
model.add(Dropout(0.3))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.3))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.3))
model.add(Dense(3))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model_info = model.fit(train_tweets, train_sentiment,
batch_size=256,
epochs=100,
verbose=1,
validation_split=0.1)
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 512) 512512 _________________________________________________________________ activation (Activation) (None, 512) 0 _________________________________________________________________ dropout (Dropout) (None, 512) 0 _________________________________________________________________ dense_1 (Dense) (None, 512) 262656 _________________________________________________________________ activation_1 (Activation) (None, 512) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 512) 0 _________________________________________________________________ dense_2 (Dense) (None, 512) 262656 _________________________________________________________________ activation_2 (Activation) (None, 512) 0 _________________________________________________________________ dropout_2 (Dropout) (None, 512) 0 _________________________________________________________________ dense_3 (Dense) (None, 3) 1539 _________________________________________________________________ activation_3 (Activation) (None, 3) 0 ================================================================= Total params: 1,039,363 Trainable params: 1,039,363 Non-trainable params: 0 _________________________________________________________________ Epoch 1/100 42/42 [==============================] - 2s 44ms/step - loss: 0.7834 - accuracy: 0.6556 - val_loss: 0.5722 - val_accuracy: 0.7782 Epoch 2/100 42/42 [==============================] - 2s 40ms/step - loss: 0.5163 - accuracy: 0.7969 - val_loss: 0.5571 - val_accuracy: 0.7679 Epoch 3/100 42/42 [==============================] - 2s 39ms/step - loss: 0.3942 - accuracy: 0.8444 - val_loss: 0.6011 - val_accuracy: 0.7645 Epoch 4/100 42/42 [==============================] - 2s 39ms/step - loss: 0.3055 - accuracy: 0.8790 - val_loss: 0.6472 - val_accuracy: 0.7611 Epoch 5/100 42/42 [==============================] - 2s 40ms/step - loss: 0.2324 - accuracy: 0.9103 - val_loss: 0.7580 - val_accuracy: 0.7662 Epoch 6/100 42/42 [==============================] - 2s 39ms/step - loss: 0.1905 - accuracy: 0.9259 - val_loss: 0.8893 - val_accuracy: 0.7611 Epoch 7/100 42/42 [==============================] - 2s 39ms/step - loss: 0.1445 - accuracy: 0.9466 - val_loss: 0.9553 - val_accuracy: 0.7662 Epoch 8/100 42/42 [==============================] - 2s 41ms/step - loss: 0.1227 - accuracy: 0.9552 - val_loss: 1.0535 - val_accuracy: 0.7688 Epoch 9/100 42/42 [==============================] - 2s 38ms/step - loss: 0.0981 - accuracy: 0.9633 - val_loss: 1.2195 - val_accuracy: 0.7560 Epoch 10/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0921 - accuracy: 0.9669 - val_loss: 1.2681 - val_accuracy: 0.7406 Epoch 11/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0901 - accuracy: 0.9665 - val_loss: 1.3274 - val_accuracy: 0.7474 Epoch 12/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0783 - accuracy: 0.9700 - val_loss: 1.4090 - val_accuracy: 0.7551 Epoch 13/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0812 - accuracy: 0.9713 - val_loss: 1.2659 - val_accuracy: 0.7526 Epoch 14/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0685 - accuracy: 0.9739 - val_loss: 1.4992 - val_accuracy: 0.7491 Epoch 15/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0741 - accuracy: 0.9739 - val_loss: 1.3658 - val_accuracy: 0.7526 Epoch 16/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0627 - accuracy: 0.9783 - val_loss: 1.4421 - val_accuracy: 0.7534 Epoch 17/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0571 - accuracy: 0.9799 - val_loss: 1.5812 - val_accuracy: 0.7637 Epoch 18/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0567 - accuracy: 0.9797 - val_loss: 1.5937 - val_accuracy: 0.7526 Epoch 19/100 42/42 [==============================] - 2s 38ms/step - loss: 0.0582 - accuracy: 0.9778 - val_loss: 1.5950 - val_accuracy: 0.7637 Epoch 20/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0588 - accuracy: 0.9787 - val_loss: 1.6289 - val_accuracy: 0.7466 Epoch 21/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0628 - accuracy: 0.9775 - val_loss: 1.4828 - val_accuracy: 0.7602 Epoch 22/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0557 - accuracy: 0.9796 - val_loss: 1.5347 - val_accuracy: 0.7662 Epoch 23/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0600 - accuracy: 0.9773 - val_loss: 1.6384 - val_accuracy: 0.7543 Epoch 24/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0518 - accuracy: 0.9830 - val_loss: 1.6749 - val_accuracy: 0.7585 Epoch 25/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0497 - accuracy: 0.9809 - val_loss: 1.6974 - val_accuracy: 0.7500 Epoch 26/100 42/42 [==============================] - 2s 38ms/step - loss: 0.0477 - accuracy: 0.9822 - val_loss: 1.8918 - val_accuracy: 0.7509 Epoch 27/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0520 - accuracy: 0.9802 - val_loss: 1.7418 - val_accuracy: 0.7551 Epoch 28/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0498 - accuracy: 0.9814 - val_loss: 1.8283 - val_accuracy: 0.7637 Epoch 29/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0478 - accuracy: 0.9817 - val_loss: 1.7238 - val_accuracy: 0.7491 Epoch 30/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0476 - accuracy: 0.9809 - val_loss: 1.8596 - val_accuracy: 0.7509 Epoch 31/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0513 - accuracy: 0.9797 - val_loss: 1.8395 - val_accuracy: 0.7526 Epoch 32/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0507 - accuracy: 0.9815 - val_loss: 1.8410 - val_accuracy: 0.7534 Epoch 33/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0467 - accuracy: 0.9821 - val_loss: 1.9732 - val_accuracy: 0.7585 Epoch 34/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0507 - accuracy: 0.9809 - val_loss: 1.7967 - val_accuracy: 0.7560 Epoch 35/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0430 - accuracy: 0.9831 - val_loss: 2.0053 - val_accuracy: 0.7577 Epoch 36/100 42/42 [==============================] - 2s 41ms/step - loss: 0.0444 - accuracy: 0.9824 - val_loss: 2.0215 - val_accuracy: 0.7577 Epoch 37/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0460 - accuracy: 0.9822 - val_loss: 1.9831 - val_accuracy: 0.7500 Epoch 38/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0451 - accuracy: 0.9821 - val_loss: 1.9429 - val_accuracy: 0.7534 Epoch 39/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0450 - accuracy: 0.9830 - val_loss: 1.9390 - val_accuracy: 0.7500 Epoch 40/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0404 - accuracy: 0.9836 - val_loss: 2.0201 - val_accuracy: 0.7560 Epoch 41/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0401 - accuracy: 0.9838 - val_loss: 2.0830 - val_accuracy: 0.7568 Epoch 42/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0418 - accuracy: 0.9835 - val_loss: 2.1208 - val_accuracy: 0.7577 Epoch 43/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0402 - accuracy: 0.9849 - val_loss: 2.1477 - val_accuracy: 0.7568 Epoch 44/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0409 - accuracy: 0.9839 - val_loss: 2.1973 - val_accuracy: 0.7628 Epoch 45/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0391 - accuracy: 0.9843 - val_loss: 2.2452 - val_accuracy: 0.7568 Epoch 46/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0393 - accuracy: 0.9837 - val_loss: 2.1677 - val_accuracy: 0.7611 Epoch 47/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0408 - accuracy: 0.9836 - val_loss: 2.1203 - val_accuracy: 0.7602 Epoch 48/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0431 - accuracy: 0.9831 - val_loss: 2.0539 - val_accuracy: 0.7585 Epoch 49/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0430 - accuracy: 0.9830 - val_loss: 2.1900 - val_accuracy: 0.7637 Epoch 50/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0445 - accuracy: 0.9831 - val_loss: 2.1696 - val_accuracy: 0.7534 Epoch 51/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0400 - accuracy: 0.9831 - val_loss: 2.2476 - val_accuracy: 0.7577 Epoch 52/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0426 - accuracy: 0.9831 - val_loss: 2.1742 - val_accuracy: 0.7585 Epoch 53/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0372 - accuracy: 0.9851 - val_loss: 2.2764 - val_accuracy: 0.7585 Epoch 54/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0400 - accuracy: 0.9842 - val_loss: 2.2031 - val_accuracy: 0.7491 Epoch 55/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0394 - accuracy: 0.9833 - val_loss: 2.2169 - val_accuracy: 0.7568 Epoch 56/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0349 - accuracy: 0.9850 - val_loss: 2.3895 - val_accuracy: 0.7585 Epoch 57/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0380 - accuracy: 0.9849 - val_loss: 2.4539 - val_accuracy: 0.7534 Epoch 58/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0428 - accuracy: 0.9830 - val_loss: 2.4014 - val_accuracy: 0.7543 Epoch 59/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0423 - accuracy: 0.9842 - val_loss: 2.2416 - val_accuracy: 0.7517 Epoch 60/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0419 - accuracy: 0.9832 - val_loss: 2.2194 - val_accuracy: 0.7637 Epoch 61/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0392 - accuracy: 0.9844 - val_loss: 2.2752 - val_accuracy: 0.7534 Epoch 62/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0366 - accuracy: 0.9850 - val_loss: 2.4248 - val_accuracy: 0.7654 Epoch 63/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0382 - accuracy: 0.9849 - val_loss: 2.2418 - val_accuracy: 0.7628 Epoch 64/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0362 - accuracy: 0.9852 - val_loss: 2.4139 - val_accuracy: 0.7679 Epoch 65/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0374 - accuracy: 0.9849 - val_loss: 2.4344 - val_accuracy: 0.7671 Epoch 66/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0382 - accuracy: 0.9836 - val_loss: 2.4147 - val_accuracy: 0.7491 Epoch 67/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0375 - accuracy: 0.9845 - val_loss: 2.6060 - val_accuracy: 0.7517 Epoch 68/100 42/42 [==============================] - 2s 39ms/step - loss: 0.0398 - accuracy: 0.9835 - val_loss: 2.3802 - val_accuracy: 0.7560 Epoch 69/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0363 - accuracy: 0.9848 - val_loss: 2.5653 - val_accuracy: 0.7517 Epoch 70/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0390 - accuracy: 0.9838 - val_loss: 2.6832 - val_accuracy: 0.7517 Epoch 71/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0393 - accuracy: 0.9839 - val_loss: 2.5877 - val_accuracy: 0.7543 Epoch 72/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0371 - accuracy: 0.9842 - val_loss: 2.6294 - val_accuracy: 0.7577 Epoch 73/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0370 - accuracy: 0.9847 - val_loss: 2.5448 - val_accuracy: 0.7432 Epoch 74/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0361 - accuracy: 0.9850 - val_loss: 2.6426 - val_accuracy: 0.7500 Epoch 75/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0380 - accuracy: 0.9845 - val_loss: 2.5819 - val_accuracy: 0.7440 Epoch 76/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0388 - accuracy: 0.9846 - val_loss: 2.3415 - val_accuracy: 0.7457 Epoch 77/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0342 - accuracy: 0.9846 - val_loss: 2.6283 - val_accuracy: 0.7568 Epoch 78/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0349 - accuracy: 0.9857 - val_loss: 2.6408 - val_accuracy: 0.7568 Epoch 79/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0353 - accuracy: 0.9845 - val_loss: 2.7258 - val_accuracy: 0.7526 Epoch 80/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0349 - accuracy: 0.9867 - val_loss: 2.8044 - val_accuracy: 0.7585 Epoch 81/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0364 - accuracy: 0.9838 - val_loss: 2.8524 - val_accuracy: 0.7517 Epoch 82/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0401 - accuracy: 0.9840 - val_loss: 2.5636 - val_accuracy: 0.7517 Epoch 83/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0367 - accuracy: 0.9848 - val_loss: 2.5907 - val_accuracy: 0.7509 Epoch 84/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0360 - accuracy: 0.9849 - val_loss: 2.7221 - val_accuracy: 0.7534 Epoch 85/100 42/42 [==============================] - 2s 41ms/step - loss: 0.0382 - accuracy: 0.9844 - val_loss: 2.5336 - val_accuracy: 0.7637 Epoch 86/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0377 - accuracy: 0.9842 - val_loss: 2.3881 - val_accuracy: 0.7602 Epoch 87/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0355 - accuracy: 0.9849 - val_loss: 2.7187 - val_accuracy: 0.7602 Epoch 88/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0325 - accuracy: 0.9861 - val_loss: 2.8815 - val_accuracy: 0.7619 Epoch 89/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0338 - accuracy: 0.9853 - val_loss: 2.9482 - val_accuracy: 0.7526 Epoch 90/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0358 - accuracy: 0.9852 - val_loss: 2.7399 - val_accuracy: 0.7509 Epoch 91/100 42/42 [==============================] - 2s 41ms/step - loss: 0.0365 - accuracy: 0.9848 - val_loss: 2.7841 - val_accuracy: 0.7526 Epoch 92/100 42/42 [==============================] - 2s 41ms/step - loss: 0.0355 - accuracy: 0.9853 - val_loss: 2.6911 - val_accuracy: 0.7560 Epoch 93/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0395 - accuracy: 0.9843 - val_loss: 2.5263 - val_accuracy: 0.7560 Epoch 94/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0375 - accuracy: 0.9845 - val_loss: 2.4383 - val_accuracy: 0.7526 Epoch 95/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0383 - accuracy: 0.9845 - val_loss: 2.4967 - val_accuracy: 0.7491 Epoch 96/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0357 - accuracy: 0.9853 - val_loss: 2.5857 - val_accuracy: 0.7645 Epoch 97/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0373 - accuracy: 0.9849 - val_loss: 2.3094 - val_accuracy: 0.7491 Epoch 98/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0373 - accuracy: 0.9853 - val_loss: 2.3751 - val_accuracy: 0.7602 Epoch 99/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0346 - accuracy: 0.9852 - val_loss: 2.4781 - val_accuracy: 0.7654 Epoch 100/100 42/42 [==============================] - 2s 40ms/step - loss: 0.0342 - accuracy: 0.9852 - val_loss: 2.7432 - val_accuracy: 0.7560
Evaluating the algorithm.
# As the last step, we evaluate the performance of our algorithm on the test set using the following script:
result = model.evaluate(test_tweets, test_sentiment,
batch_size=256, verbose=1)
print('Test accuracy:', result [1])
12/12 [==============================] - 0s 13ms/step - loss: 2.7632 - accuracy: 0.7766 Test accuracy: 0.7766393423080444
Observations:
Not a strong model performance in validation, and test. Overfitting in training. To further improve the accuracy, we can try a different number of layers, drop out, epochs, and activation.
Harnessing the TfidVectorizer to prepare the datatset for the Naive Bayes Classifier model.
# using Tfidf Vectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 2000)
x_train_vec = vectorizer.fit_transform(X_train)
x_test_vec = vectorizer.transform(X_test)
Using the Naive Bayes Classifier to generate an accuracy score.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(x_train_vec, y_train)
y_pred = clf.predict(x_test_vec)
print(clf.score(x_train_vec, y_train))
0.7680157103825137
The score of 76.81% is not that great but falls in line with the test accouracy of the immediately preceding neural network fed by a keras generated dataset.
# Create and write the results to a local .csv file.
np.savetxt("Predictions_twitter_sentiments.csv", y_pred, fmt="%s")
# Make a copy of the original dataframe to use specifically for the Tensorflow data pre-processing.
df_Tweets_TF = df_Tweets_Orig.copy()
# Look at the first 25 rows.
df_Tweets_TF.head(25)
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 570306133677760513 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | cairdin | NaN | 0 | @VirginAmerica What @dhepburn said. | NaN | 2015-02-24 11:35:52 -0800 | NaN | Eastern Time (US & Canada) |
| 1 | 570301130888122368 | positive | 0.3486 | NaN | 0.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica plus you've added commercials to the experience... tacky. | NaN | 2015-02-24 11:15:59 -0800 | NaN | Pacific Time (US & Canada) |
| 2 | 570301083672813571 | neutral | 0.6837 | NaN | NaN | Virgin America | NaN | yvonnalynn | NaN | 0 | @VirginAmerica I didn't today... Must mean I need to take another trip! | NaN | 2015-02-24 11:15:48 -0800 | Lets Play | Central Time (US & Canada) |
| 3 | 570301031407624196 | negative | 1.0000 | Bad Flight | 0.7033 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse | NaN | 2015-02-24 11:15:36 -0800 | NaN | Pacific Time (US & Canada) |
| 4 | 570300817074462722 | negative | 1.0000 | Can't Tell | 1.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica and it's a really big bad thing about it | NaN | 2015-02-24 11:14:45 -0800 | NaN | Pacific Time (US & Canada) |
| 5 | 570300767074181121 | negative | 1.0000 | Can't Tell | 0.6842 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA | NaN | 2015-02-24 11:14:33 -0800 | NaN | Pacific Time (US & Canada) |
| 6 | 570300616901320704 | positive | 0.6745 | NaN | 0.0000 | Virgin America | NaN | cjmcginnis | NaN | 0 | @VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :) | NaN | 2015-02-24 11:13:57 -0800 | San Francisco CA | Pacific Time (US & Canada) |
| 7 | 570300248553349120 | neutral | 0.6340 | NaN | NaN | Virgin America | NaN | pilot | NaN | 0 | @VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP | NaN | 2015-02-24 11:12:29 -0800 | Los Angeles | Pacific Time (US & Canada) |
| 8 | 570299953286942721 | positive | 0.6559 | NaN | NaN | Virgin America | NaN | dhepburn | NaN | 0 | @virginamerica Well, I didn't…but NOW I DO! :-D | NaN | 2015-02-24 11:11:19 -0800 | San Diego | Pacific Time (US & Canada) |
| 9 | 570295459631263746 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | YupitsTate | NaN | 0 | @VirginAmerica it was amazing, and arrived an hour early. You're too good to me. | NaN | 2015-02-24 10:53:27 -0800 | Los Angeles | Eastern Time (US & Canada) |
| 10 | 570294189143031808 | neutral | 0.6769 | NaN | 0.0000 | Virgin America | NaN | idk_but_youtube | NaN | 0 | @VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24 | NaN | 2015-02-24 10:48:24 -0800 | 1/1 loner squad | Eastern Time (US & Canada) |
| 11 | 570289724453216256 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | HyperCamiLax | NaN | 0 | @VirginAmerica I <3 pretty graphics. so much better than minimal iconography. :D | NaN | 2015-02-24 10:30:40 -0800 | NYC | America/New_York |
| 12 | 570289584061480960 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | HyperCamiLax | NaN | 0 | @VirginAmerica This is such a great deal! Already thinking about my 2nd trip to @Australia & I haven't even gone on my 1st trip yet! ;p | NaN | 2015-02-24 10:30:06 -0800 | NYC | America/New_York |
| 13 | 570287408438120448 | positive | 0.6451 | NaN | NaN | Virgin America | NaN | mollanderson | NaN | 0 | @VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel http://t.co/ahlXHhKiyn | NaN | 2015-02-24 10:21:28 -0800 | NaN | Eastern Time (US & Canada) |
| 14 | 570285904809598977 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | sjespers | NaN | 0 | @VirginAmerica Thanks! | NaN | 2015-02-24 10:15:29 -0800 | San Francisco, CA | Pacific Time (US & Canada) |
| 15 | 570282469121007616 | negative | 0.6842 | Late Flight | 0.3684 | Virgin America | NaN | smartwatermelon | NaN | 0 | @VirginAmerica SFO-PDX schedule is still MIA. | NaN | 2015-02-24 10:01:50 -0800 | palo alto, ca | Pacific Time (US & Canada) |
| 16 | 570277724385734656 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | ItzBrianHunty | NaN | 0 | @VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo | NaN | 2015-02-24 09:42:59 -0800 | west covina | Pacific Time (US & Canada) |
| 17 | 570276917301137409 | negative | 1.0000 | Bad Flight | 1.0000 | Virgin America | NaN | heatherovieda | NaN | 0 | @VirginAmerica I flew from NYC to SFO last week and couldn't fully sit in my seat due to two large gentleman on either side of me. HELP! | NaN | 2015-02-24 09:39:46 -0800 | this place called NYC | Eastern Time (US & Canada) |
| 18 | 570270684619923457 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | thebrandiray | NaN | 0 | I ❤️ flying @VirginAmerica. ☺️👍 | NaN | 2015-02-24 09:15:00 -0800 | Somewhere celebrating life. | Atlantic Time (Canada) |
| 19 | 570267956648792064 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | JNLpierce | NaN | 0 | @VirginAmerica you know what would be amazingly awesome? BOS-FLL PLEASE!!!!!!! I want to fly with only you. | NaN | 2015-02-24 09:04:10 -0800 | Boston | Waltham | Quito |
| 20 | 570265883513384960 | negative | 0.6705 | Can't Tell | 0.3614 | Virgin America | NaN | MISSGJ | NaN | 0 | @VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select??? | NaN | 2015-02-24 08:55:56 -0800 | NaN | NaN |
| 21 | 570264145116819457 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | DT_Les | NaN | 0 | @VirginAmerica I love this graphic. http://t.co/UT5GrRwAaA | [40.74804263, -73.99295302] | 2015-02-24 08:49:01 -0800 | NaN | NaN |
| 22 | 570259420287868928 | positive | 1.0000 | NaN | NaN | Virgin America | NaN | ElvinaBeck | NaN | 0 | @VirginAmerica I love the hipster innovation. You are a feel good brand. | NaN | 2015-02-24 08:30:15 -0800 | Los Angeles | Pacific Time (US & Canada) |
| 23 | 570258822297579520 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | rjlynch21086 | NaN | 0 | @VirginAmerica will you be making BOS>LAS non stop permanently anytime soon? | NaN | 2015-02-24 08:27:52 -0800 | Boston, MA | Eastern Time (US & Canada) |
| 24 | 570256553502068736 | negative | 1.0000 | Customer Service Issue | 0.3557 | Virgin America | NaN | ayeevickiee | NaN | 0 | @VirginAmerica you guys messed up my seating.. I reserved seating with my friends and you guys gave my seat away ... 😡 I want free internet | NaN | 2015-02-24 08:18:51 -0800 | 714 | Mountain Time (US & Canada) |
# For sentiment analysis I am only focused on the 2 columns that contains the tweet text and sentiment label.
df_Tweets_TF[['text', 'airline_sentiment']].head(25)
| text | airline_sentiment | |
|---|---|---|
| 0 | @VirginAmerica What @dhepburn said. | neutral |
| 1 | @VirginAmerica plus you've added commercials to the experience... tacky. | positive |
| 2 | @VirginAmerica I didn't today... Must mean I need to take another trip! | neutral |
| 3 | @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse | negative |
| 4 | @VirginAmerica and it's a really big bad thing about it | negative |
| 5 | @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA | negative |
| 6 | @VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :) | positive |
| 7 | @VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP | neutral |
| 8 | @virginamerica Well, I didn't…but NOW I DO! :-D | positive |
| 9 | @VirginAmerica it was amazing, and arrived an hour early. You're too good to me. | positive |
| 10 | @VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24 | neutral |
| 11 | @VirginAmerica I <3 pretty graphics. so much better than minimal iconography. :D | positive |
| 12 | @VirginAmerica This is such a great deal! Already thinking about my 2nd trip to @Australia & I haven't even gone on my 1st trip yet! ;p | positive |
| 13 | @VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel http://t.co/ahlXHhKiyn | positive |
| 14 | @VirginAmerica Thanks! | positive |
| 15 | @VirginAmerica SFO-PDX schedule is still MIA. | negative |
| 16 | @VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo | positive |
| 17 | @VirginAmerica I flew from NYC to SFO last week and couldn't fully sit in my seat due to two large gentleman on either side of me. HELP! | negative |
| 18 | I ❤️ flying @VirginAmerica. ☺️👍 | positive |
| 19 | @VirginAmerica you know what would be amazingly awesome? BOS-FLL PLEASE!!!!!!! I want to fly with only you. | positive |
| 20 | @VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select??? | negative |
| 21 | @VirginAmerica I love this graphic. http://t.co/UT5GrRwAaA | positive |
| 22 | @VirginAmerica I love the hipster innovation. You are a feel good brand. | positive |
| 23 | @VirginAmerica will you be making BOS>LAS non stop permanently anytime soon? | neutral |
| 24 | @VirginAmerica you guys messed up my seating.. I reserved seating with my friends and you guys gave my seat away ... 😡 I want free internet | negative |
# Count the numbers of each type of tweet.
df_Tweets_TF['airline_sentiment'].value_counts()
negative 9178 neutral 3099 positive 2363 Name: airline_sentiment, dtype: int64
# Convert the airline_sentiment attribute into integer data format.
# Tensorflow requires the data in this format.
df_Tweets_TF ['airline_sentiment'] = df_Tweets_TF ['airline_sentiment'].replace('neutral', 1)
df_Tweets_TF ['airline_sentiment'] = df_Tweets_TF ['airline_sentiment'].replace('negative', 0)
df_Tweets_TF ['airline_sentiment'] = df_Tweets_TF ['airline_sentiment'].replace('positive', 2)
# Split the dataset into text (the independent variable) and label (the dependent variable).
X = df_Tweets_TF['text'] # data
y = df_Tweets_TF['airline_sentiment'] # labels
# Import the relevant Tensorflow libraries.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras.utils import to_categorical
# Convert the training data into tensors to feed into the neural network.
# Create the tokenizer class object.
t = Tokenizer()
t.fit_on_texts(X)
# How many unique words are present in the Twitter US Airline dataset?
vocab_size = len(t.word_index) + 1
# Texts are encoded into numeric inter values so the model can machine learn.
sequences = t.texts_to_sequences(X)
# Create a function to find what the longest tweet in the dataset is.
def max_tweet():
for i in range(1, len(sequences)):
max_length = len(sequences[0])
if len(sequences[i]) > max_length:
max_length = len(sequences[i])
return max_length
tweet_num = max_tweet()
tweet_num
30
Observation:
The longest tweet has 30 words.
# Each tweet has a different number of words, so the shorter sequences are padded with 0's.
# https://realpython.com/python-keras-text-classification/
from tensorflow.keras.preprocessing.sequence import pad_sequences
maxlen = tweet_num
padded_X = pad_sequences(sequences, padding='post', maxlen=maxlen)
# Convert the labels to a categorical numpy array.
labels = to_categorical(np.asarray(y))
# Split the dataset into train and test.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(padded_X, labels, test_size = 0.2, random_state = 0)
# Display the sizes of the train and test datasets.
print('X_train size:', X_train.shape)
print('y_train size:', y_train.shape)
print('X_test size:', X_test.shape)
print('y_test size:', y_test.shape)
X_train size: (11712, 30) y_train size: (11712, 3) X_test size: (2928, 30) y_test size: (2928, 3)
Word Embedding.
In NLP, textual data must be represented in a way that computers can work with. We will focus on word embeddings which is a representation of text where similar words have a similar representation. One model of word embedding is word2vec which takes a large corpus of text and outputs a vector space where each unique word has its own corresponding vector. In this space, words with similar meanings are located close to one another.
Another popular model is the Global Vectors for Word Representation (GloVe) which is an extension of word2vec. It generally allows for better word embeddings by creating a word-context matrix. Basically, it creates a measure to indicate that certain words are more likely to be seen in context of others. For example, the word “chip” is likely to be seen in the context of “potato” but not with “cloud”. Its developers created the embeddings using English words obtained from Wikipedia and Common Crawl data.
I will use a pre-trained word embedding, because I believe GloVe generalizes well with the dataset. The embedding space created by GloVe likely contains all the words we will encounter in our tweets, so we can use these vector representations instead of creating our own from a much more limited vocabulary set.
# Load the whole embedding into memory.
# GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
# Training is performed on aggregated global word-word co-occurrence statistics from a corpus,
# and the resulting representations showcase interesting linear substructures of the word vector space.
# https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
# 100 dimensional version (embedding dimension).
embeddings_index = dict()
f = open('glove.6B.100d.txt') # the text file was downloded from the Stanford University website (https://nlp.stanford.edu/projects/glove/).
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))
Loaded 261962 word vectors.
# GloVE requires the creation of a word embedding/word context matrix.
# Create a matrix of weights for words in the training set.
# One embedding for each word in the training set.
# Get all unique words in our training set: Tokenizer index.
# Find the corresponding weight vector in GloVe embedding.
# Define size of embedding matrix: number of unique words x embedding dim (100).
embedding_matrix = np.zeros((vocab_size, 100))
# Fill in the matrix.
for word, i in t.word_index.items(): # Dictionary.
embedding_vector = embeddings_index.get(word) # Retreives the embedded vector of word from GloVe.
if embedding_vector is not None:
# add to matrix
embedding_matrix[i] = embedding_vector # Each row of the matrix.
# Create embedding layer using embedding matrix.
from tensorflow.keras.layers import Embedding
# Input is vocab_size, output is 100.
# Weights from embedding matrix, set trainable = False.
embedding_layer = Embedding(input_dim=vocab_size, output_dim=100, weights=[embedding_matrix],
input_length = tweet_num, trainable=False)
# Imoort the Tensorflow libraries.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import GRU
from tensorflow.keras.layers import BatchNormalization
Model 1: Simple LSTM Model with regularization, increase dimensionality.
Long Short-Term Memory (LSTM):
Simple Recurrent Neural Networks (RNNs) suffer from the vanishing gradient problem which occurs when information from earlier layers disappear as the network becomes deeper. A LSTM algorithm was created to avoid this problem by allowing the neural network to carry information across multiple time steps. This means it can save important information for later use, preventing gradients from vanishing during the process. Additionally, a LSTM cell can determine what information to remove as well. Therefore, it can learn to recognize an important input and store it for the future while removing unnecessary information.
# Set up the LSTM model.
lstm_model1 = Sequential()
lstm_model1.add(embedding_layer)
lstm_model1.add(LSTM(256,
dropout = 0.2,
recurrent_dropout = 0.5))
lstm_model1.add(Dense(3, activation='softmax'))
lstm_model1.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
lstm_model1.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 30, 100) 1576900 _________________________________________________________________ lstm (LSTM) (None, 256) 365568 _________________________________________________________________ dense_4 (Dense) (None, 3) 771 ================================================================= Total params: 1,943,239 Trainable params: 366,339 Non-trainable params: 1,576,900 _________________________________________________________________
# Set the hyperparameters in the LSTM model.
hist_1 = lstm_model1.fit(X_train, y_train,
validation_split = 0.2,
epochs=100, batch_size=256)
Epoch 1/100 37/37 [==============================] - 27s 667ms/step - loss: 0.8479 - acc: 0.6242 - val_loss: 0.7456 - val_acc: 0.7004 Epoch 2/100 37/37 [==============================] - 24s 658ms/step - loss: 0.6927 - acc: 0.7132 - val_loss: 0.6307 - val_acc: 0.7499 Epoch 3/100 37/37 [==============================] - 24s 656ms/step - loss: 0.6416 - acc: 0.7386 - val_loss: 0.6085 - val_acc: 0.7567 Epoch 4/100 37/37 [==============================] - 24s 654ms/step - loss: 0.6126 - acc: 0.7502 - val_loss: 0.6011 - val_acc: 0.7601 Epoch 5/100 37/37 [==============================] - 24s 653ms/step - loss: 0.5987 - acc: 0.7570 - val_loss: 0.5844 - val_acc: 0.7700 Epoch 6/100 37/37 [==============================] - 24s 655ms/step - loss: 0.5844 - acc: 0.7621 - val_loss: 0.5737 - val_acc: 0.7729 Epoch 7/100 37/37 [==============================] - 24s 656ms/step - loss: 0.5791 - acc: 0.7680 - val_loss: 0.5570 - val_acc: 0.7806 Epoch 8/100 37/37 [==============================] - 24s 655ms/step - loss: 0.5537 - acc: 0.7731 - val_loss: 0.5581 - val_acc: 0.7729 Epoch 9/100 37/37 [==============================] - 24s 659ms/step - loss: 0.5401 - acc: 0.7798 - val_loss: 0.5501 - val_acc: 0.7815 Epoch 10/100 37/37 [==============================] - 24s 657ms/step - loss: 0.5320 - acc: 0.7856 - val_loss: 0.5408 - val_acc: 0.7853 Epoch 11/100 37/37 [==============================] - 24s 657ms/step - loss: 0.5167 - acc: 0.7935 - val_loss: 0.5264 - val_acc: 0.7934 Epoch 12/100 37/37 [==============================] - 24s 662ms/step - loss: 0.4992 - acc: 0.7981 - val_loss: 0.5776 - val_acc: 0.7819 Epoch 13/100 37/37 [==============================] - 24s 653ms/step - loss: 0.5104 - acc: 0.7954 - val_loss: 0.5620 - val_acc: 0.7926 Epoch 14/100 37/37 [==============================] - 24s 654ms/step - loss: 0.4867 - acc: 0.8083 - val_loss: 0.5177 - val_acc: 0.7994 Epoch 15/100 37/37 [==============================] - 24s 654ms/step - loss: 0.4777 - acc: 0.8088 - val_loss: 0.5163 - val_acc: 0.7921 Epoch 16/100 37/37 [==============================] - 24s 656ms/step - loss: 0.4672 - acc: 0.8180 - val_loss: 0.5318 - val_acc: 0.7921 Epoch 17/100 37/37 [==============================] - 24s 654ms/step - loss: 0.4564 - acc: 0.8172 - val_loss: 0.5524 - val_acc: 0.7764 Epoch 18/100 37/37 [==============================] - 24s 654ms/step - loss: 0.4644 - acc: 0.8130 - val_loss: 0.5101 - val_acc: 0.7990 Epoch 19/100 37/37 [==============================] - 24s 655ms/step - loss: 0.4398 - acc: 0.8236 - val_loss: 0.5172 - val_acc: 0.8020 Epoch 20/100 37/37 [==============================] - 24s 658ms/step - loss: 0.4342 - acc: 0.8244 - val_loss: 0.5156 - val_acc: 0.8041 Epoch 21/100 37/37 [==============================] - 24s 655ms/step - loss: 0.4298 - acc: 0.8302 - val_loss: 0.5179 - val_acc: 0.8075 Epoch 22/100 37/37 [==============================] - 24s 662ms/step - loss: 0.4105 - acc: 0.8387 - val_loss: 0.5239 - val_acc: 0.8067 Epoch 23/100 37/37 [==============================] - 24s 655ms/step - loss: 0.4031 - acc: 0.8366 - val_loss: 0.5239 - val_acc: 0.8011 Epoch 24/100 37/37 [==============================] - 24s 655ms/step - loss: 0.3951 - acc: 0.8437 - val_loss: 0.5318 - val_acc: 0.7964 Epoch 25/100 37/37 [==============================] - 24s 655ms/step - loss: 0.3901 - acc: 0.8433 - val_loss: 0.5421 - val_acc: 0.8084 Epoch 26/100 37/37 [==============================] - 24s 653ms/step - loss: 0.3704 - acc: 0.8557 - val_loss: 0.5334 - val_acc: 0.8020 Epoch 27/100 37/37 [==============================] - 24s 654ms/step - loss: 0.3656 - acc: 0.8562 - val_loss: 0.5241 - val_acc: 0.7909 Epoch 28/100 37/37 [==============================] - 24s 652ms/step - loss: 0.3528 - acc: 0.8627 - val_loss: 0.5465 - val_acc: 0.7964 Epoch 29/100 37/37 [==============================] - 24s 653ms/step - loss: 0.3489 - acc: 0.8615 - val_loss: 0.5735 - val_acc: 0.8041 Epoch 30/100 37/37 [==============================] - 24s 652ms/step - loss: 0.3335 - acc: 0.8663 - val_loss: 0.5614 - val_acc: 0.8071 Epoch 31/100 37/37 [==============================] - 24s 652ms/step - loss: 0.3327 - acc: 0.8684 - val_loss: 0.6082 - val_acc: 0.7828 Epoch 32/100 37/37 [==============================] - 24s 654ms/step - loss: 0.3247 - acc: 0.8745 - val_loss: 0.5870 - val_acc: 0.8045 Epoch 33/100 37/37 [==============================] - 24s 655ms/step - loss: 0.3190 - acc: 0.8730 - val_loss: 0.5475 - val_acc: 0.7853 Epoch 34/100 37/37 [==============================] - 24s 654ms/step - loss: 0.3147 - acc: 0.8795 - val_loss: 0.5881 - val_acc: 0.7832 Epoch 35/100 37/37 [==============================] - 24s 652ms/step - loss: 0.3087 - acc: 0.8766 - val_loss: 0.6442 - val_acc: 0.7973 Epoch 36/100 37/37 [==============================] - 24s 653ms/step - loss: 0.2942 - acc: 0.8880 - val_loss: 0.6444 - val_acc: 0.7823 Epoch 37/100 37/37 [==============================] - 24s 657ms/step - loss: 0.2774 - acc: 0.8928 - val_loss: 0.6114 - val_acc: 0.7921 Epoch 38/100 37/37 [==============================] - 24s 651ms/step - loss: 0.2759 - acc: 0.8928 - val_loss: 0.5876 - val_acc: 0.8092 Epoch 39/100 37/37 [==============================] - 24s 656ms/step - loss: 0.2706 - acc: 0.8933 - val_loss: 0.6245 - val_acc: 0.8003 Epoch 40/100 37/37 [==============================] - 24s 654ms/step - loss: 0.2552 - acc: 0.9001 - val_loss: 0.6670 - val_acc: 0.8007 Epoch 41/100 37/37 [==============================] - 24s 656ms/step - loss: 0.2450 - acc: 0.9072 - val_loss: 0.6470 - val_acc: 0.8024 Epoch 42/100 37/37 [==============================] - 24s 655ms/step - loss: 0.2555 - acc: 0.8991 - val_loss: 0.6241 - val_acc: 0.7892 Epoch 43/100 37/37 [==============================] - 24s 650ms/step - loss: 0.2349 - acc: 0.9101 - val_loss: 0.6467 - val_acc: 0.7998 Epoch 44/100 37/37 [==============================] - 24s 653ms/step - loss: 0.2332 - acc: 0.9130 - val_loss: 0.6658 - val_acc: 0.8041 Epoch 45/100 37/37 [==============================] - 24s 657ms/step - loss: 0.2253 - acc: 0.9116 - val_loss: 0.6801 - val_acc: 0.7926 Epoch 46/100 37/37 [==============================] - 24s 654ms/step - loss: 0.2286 - acc: 0.9133 - val_loss: 0.7284 - val_acc: 0.7832 Epoch 47/100 37/37 [==============================] - 24s 654ms/step - loss: 0.2146 - acc: 0.9173 - val_loss: 0.6722 - val_acc: 0.7900 Epoch 48/100 37/37 [==============================] - 24s 653ms/step - loss: 0.1984 - acc: 0.9246 - val_loss: 0.7160 - val_acc: 0.7968 Epoch 49/100 37/37 [==============================] - 24s 657ms/step - loss: 0.2019 - acc: 0.9211 - val_loss: 0.7163 - val_acc: 0.8007 Epoch 50/100 37/37 [==============================] - 24s 656ms/step - loss: 0.1915 - acc: 0.9259 - val_loss: 0.7849 - val_acc: 0.7772 Epoch 51/100 37/37 [==============================] - 24s 655ms/step - loss: 0.1970 - acc: 0.9235 - val_loss: 0.7561 - val_acc: 0.7939 Epoch 52/100 37/37 [==============================] - 24s 654ms/step - loss: 0.1843 - acc: 0.9280 - val_loss: 0.7221 - val_acc: 0.7930 Epoch 53/100 37/37 [==============================] - 24s 653ms/step - loss: 0.1709 - acc: 0.9365 - val_loss: 0.7564 - val_acc: 0.7862 Epoch 54/100 37/37 [==============================] - 24s 654ms/step - loss: 0.1687 - acc: 0.9347 - val_loss: 0.7913 - val_acc: 0.7968 Epoch 55/100 37/37 [==============================] - 24s 654ms/step - loss: 0.1637 - acc: 0.9372 - val_loss: 0.8073 - val_acc: 0.8003 Epoch 56/100 37/37 [==============================] - 24s 656ms/step - loss: 0.1681 - acc: 0.9348 - val_loss: 0.7621 - val_acc: 0.7939 Epoch 57/100 37/37 [==============================] - 24s 657ms/step - loss: 0.1651 - acc: 0.9383 - val_loss: 0.7583 - val_acc: 0.7956 Epoch 58/100 37/37 [==============================] - 24s 655ms/step - loss: 0.1480 - acc: 0.9427 - val_loss: 0.7487 - val_acc: 0.7921 Epoch 59/100 37/37 [==============================] - 24s 657ms/step - loss: 0.1478 - acc: 0.9457 - val_loss: 0.7869 - val_acc: 0.8003 Epoch 60/100 37/37 [==============================] - 24s 653ms/step - loss: 0.1449 - acc: 0.9468 - val_loss: 0.8185 - val_acc: 0.7913 Epoch 61/100 37/37 [==============================] - 24s 658ms/step - loss: 0.1506 - acc: 0.9433 - val_loss: 0.7771 - val_acc: 0.7849 Epoch 62/100 37/37 [==============================] - 24s 655ms/step - loss: 0.1436 - acc: 0.9503 - val_loss: 0.8286 - val_acc: 0.7960 Epoch 63/100 37/37 [==============================] - 24s 657ms/step - loss: 0.1421 - acc: 0.9493 - val_loss: 0.8105 - val_acc: 0.7956 Epoch 64/100 37/37 [==============================] - 24s 654ms/step - loss: 0.1364 - acc: 0.9497 - val_loss: 0.8364 - val_acc: 0.7913 Epoch 65/100 37/37 [==============================] - 24s 654ms/step - loss: 0.1398 - acc: 0.9492 - val_loss: 0.8689 - val_acc: 0.7819 Epoch 66/100 37/37 [==============================] - 24s 658ms/step - loss: 0.1381 - acc: 0.9505 - val_loss: 0.8361 - val_acc: 0.7840 Epoch 67/100 37/37 [==============================] - 24s 654ms/step - loss: 0.1199 - acc: 0.9560 - val_loss: 0.8763 - val_acc: 0.7900 Epoch 68/100 37/37 [==============================] - 24s 652ms/step - loss: 0.1258 - acc: 0.9533 - val_loss: 0.8099 - val_acc: 0.7887 Epoch 69/100 37/37 [==============================] - 24s 656ms/step - loss: 0.1184 - acc: 0.9561 - val_loss: 0.9679 - val_acc: 0.7943 Epoch 70/100 37/37 [==============================] - 24s 653ms/step - loss: 0.1133 - acc: 0.9585 - val_loss: 0.8928 - val_acc: 0.7921 Epoch 71/100 37/37 [==============================] - 24s 657ms/step - loss: 0.1114 - acc: 0.9614 - val_loss: 0.9479 - val_acc: 0.7892 Epoch 72/100 37/37 [==============================] - 24s 656ms/step - loss: 0.1179 - acc: 0.9558 - val_loss: 0.8606 - val_acc: 0.7973 Epoch 73/100 37/37 [==============================] - 24s 652ms/step - loss: 0.1091 - acc: 0.9608 - val_loss: 0.9206 - val_acc: 0.7939 Epoch 74/100 37/37 [==============================] - 24s 653ms/step - loss: 0.1056 - acc: 0.9642 - val_loss: 1.0266 - val_acc: 0.7917 Epoch 75/100 37/37 [==============================] - 24s 658ms/step - loss: 0.1029 - acc: 0.9635 - val_loss: 0.9972 - val_acc: 0.7904 Epoch 76/100 37/37 [==============================] - 24s 659ms/step - loss: 0.1057 - acc: 0.9638 - val_loss: 0.9456 - val_acc: 0.7862 Epoch 77/100 37/37 [==============================] - 24s 659ms/step - loss: 0.1003 - acc: 0.9644 - val_loss: 0.9709 - val_acc: 0.7904 Epoch 78/100 37/37 [==============================] - 24s 653ms/step - loss: 0.1013 - acc: 0.9622 - val_loss: 0.9360 - val_acc: 0.7968 Epoch 79/100 37/37 [==============================] - 24s 657ms/step - loss: 0.1083 - acc: 0.9617 - val_loss: 0.8739 - val_acc: 0.7934 Epoch 80/100 37/37 [==============================] - 24s 656ms/step - loss: 0.0971 - acc: 0.9694 - val_loss: 1.0100 - val_acc: 0.7930 Epoch 81/100 37/37 [==============================] - 24s 662ms/step - loss: 0.1010 - acc: 0.9620 - val_loss: 0.9266 - val_acc: 0.7956 Epoch 82/100 37/37 [==============================] - 24s 657ms/step - loss: 0.0898 - acc: 0.9686 - val_loss: 0.9704 - val_acc: 0.7956 Epoch 83/100 37/37 [==============================] - 24s 655ms/step - loss: 0.0887 - acc: 0.9686 - val_loss: 1.0124 - val_acc: 0.7892 Epoch 84/100 37/37 [==============================] - 24s 654ms/step - loss: 0.0902 - acc: 0.9688 - val_loss: 0.9257 - val_acc: 0.7853 Epoch 85/100 37/37 [==============================] - 24s 657ms/step - loss: 0.0891 - acc: 0.9671 - val_loss: 0.9986 - val_acc: 0.7998 Epoch 86/100 37/37 [==============================] - 24s 657ms/step - loss: 0.0835 - acc: 0.9727 - val_loss: 0.8916 - val_acc: 0.7776 Epoch 87/100 37/37 [==============================] - 24s 660ms/step - loss: 0.0870 - acc: 0.9682 - val_loss: 0.9895 - val_acc: 0.7964 Epoch 88/100 37/37 [==============================] - 24s 656ms/step - loss: 0.0888 - acc: 0.9682 - val_loss: 1.0574 - val_acc: 0.7913 Epoch 89/100 37/37 [==============================] - 24s 655ms/step - loss: 0.0795 - acc: 0.9711 - val_loss: 0.9888 - val_acc: 0.7930 Epoch 90/100 37/37 [==============================] - 24s 657ms/step - loss: 0.0814 - acc: 0.9722 - val_loss: 1.0326 - val_acc: 0.7913 Epoch 91/100 37/37 [==============================] - 24s 657ms/step - loss: 0.0723 - acc: 0.9745 - val_loss: 1.0763 - val_acc: 0.7921 Epoch 92/100 37/37 [==============================] - 24s 655ms/step - loss: 0.0886 - acc: 0.9689 - val_loss: 0.9668 - val_acc: 0.7956 Epoch 93/100 37/37 [==============================] - 24s 657ms/step - loss: 0.0722 - acc: 0.9755 - val_loss: 1.0854 - val_acc: 0.7981 Epoch 94/100 37/37 [==============================] - 24s 656ms/step - loss: 0.0836 - acc: 0.9712 - val_loss: 1.0097 - val_acc: 0.7939 Epoch 95/100 37/37 [==============================] - 24s 656ms/step - loss: 0.0692 - acc: 0.9760 - val_loss: 1.0781 - val_acc: 0.7943 Epoch 96/100 37/37 [==============================] - 24s 661ms/step - loss: 0.0780 - acc: 0.9708 - val_loss: 1.0982 - val_acc: 0.7926 Epoch 97/100 37/37 [==============================] - 24s 658ms/step - loss: 0.0681 - acc: 0.9774 - val_loss: 1.0880 - val_acc: 0.7990 Epoch 98/100 37/37 [==============================] - 24s 659ms/step - loss: 0.0691 - acc: 0.9755 - val_loss: 1.0289 - val_acc: 0.7985 Epoch 99/100 37/37 [==============================] - 24s 658ms/step - loss: 0.0720 - acc: 0.9750 - val_loss: 1.0682 - val_acc: 0.7879 Epoch 100/100 37/37 [==============================] - 24s 658ms/step - loss: 0.0662 - acc: 0.9781 - val_loss: 1.0905 - val_acc: 0.7887
# Train and test accuracy.
loss, accuracy = lstm_model1.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = lstm_model1.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
Training Accuracy: 0.9532 Testing Accuracy: 0.7807
Observation:
Slight overfitting in training and 78% accuracy in testing.
# Plot train/test loss and accuracy.
acc = hist_1.history['acc']
val_acc = hist_1.history['val_acc']
loss = hist_1.history['loss']
val_loss = hist_1.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'g', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'g', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
Observation:
Accuracy steadily increases to the point of overfitting in training but peaks at about 0.80 in validation.
Loss decreases steadily in training but initially dips and then increases to an unacceptable level in validation.
# Confusion matrix.
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
# Get predicted values
y_pred = lstm_model1.predict(X_test) # outputs probabilities of each sentiment
# Create empty numpy array to match length of training observations
y_pred_array = np.zeros(X_test.shape[0])
# Find class with highest probability
for i in range(0, y_pred.shape[0]):
label_predict = np.argmax(y_pred[i]) # column with max probability
y_pred_array[i] = label_predict
# convert to integers
y_pred_array = y_pred_array.astype(int)
# Convert y_test to 1d numpy array
y_test_array = np.zeros(X_test.shape[0])
# Find class with 1
for i in range(0, y_test.shape[0]):
label_predict = np.argmax(y_test[i])
y_test_array[i] = label_predict
y_test_array = y_test_array.astype(int)
class_names = np.array(['Negative', 'Neutral', 'Positive'])
# Create the function to plot the confusion matrix.
def plot_confusion_matrix(y_true, y_pred, classes,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
classes = classes[unique_labels(y_true, y_pred)]
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
return ax
np.set_printoptions(precision=2)
# Plot the non-normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names,
title='Confusion matrix, without normalization')
# Plot the normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names, normalize=True,
title='Normalized confusion matrix')
plt.show()
Confusion matrix, without normalization [[1600 178 92] [ 186 362 66] [ 70 50 324]] Normalized confusion matrix [[0.86 0.1 0.05] [0.3 0.59 0.11] [0.16 0.11 0.73]]
Observations:
We see in the above confusion matrices, Model 1 did an excellent job predicting a negative label when the tweet was negative but suffered more with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance.
Model 2: LSTM with regularization and reduced dimensionality.
lstm_model2 = Sequential()
lstm_model2.add(embedding_layer)
lstm_model2.add(LSTM(64,
dropout = 0.2,
recurrent_dropout = 0.5))
lstm_model2.add(Dense(3, activation='softmax'))
lstm_model2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
lstm_model2.summary()
Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 30, 100) 1576900 _________________________________________________________________ lstm_1 (LSTM) (None, 64) 42240 _________________________________________________________________ dense_5 (Dense) (None, 3) 195 ================================================================= Total params: 1,619,335 Trainable params: 42,435 Non-trainable params: 1,576,900 _________________________________________________________________
hist_2 = lstm_model2.fit(X_train, y_train,
validation_split = 0.2,
epochs=100, batch_size=256)
Epoch 1/100 37/37 [==============================] - 9s 162ms/step - loss: 0.8839 - acc: 0.6123 - val_loss: 0.8460 - val_acc: 0.6176 Epoch 2/100 37/37 [==============================] - 6s 151ms/step - loss: 0.8258 - acc: 0.6403 - val_loss: 0.7855 - val_acc: 0.6718 Epoch 3/100 37/37 [==============================] - 6s 151ms/step - loss: 0.7432 - acc: 0.6841 - val_loss: 0.7009 - val_acc: 0.7055 Epoch 4/100 37/37 [==============================] - 6s 151ms/step - loss: 0.6932 - acc: 0.7104 - val_loss: 0.6701 - val_acc: 0.7307 Epoch 5/100 37/37 [==============================] - 6s 152ms/step - loss: 0.6648 - acc: 0.7313 - val_loss: 0.6217 - val_acc: 0.7550 Epoch 6/100 37/37 [==============================] - 6s 152ms/step - loss: 0.6211 - acc: 0.7521 - val_loss: 0.5912 - val_acc: 0.7601 Epoch 7/100 37/37 [==============================] - 6s 152ms/step - loss: 0.6025 - acc: 0.7591 - val_loss: 0.5771 - val_acc: 0.7687 Epoch 8/100 37/37 [==============================] - 6s 152ms/step - loss: 0.5831 - acc: 0.7681 - val_loss: 0.5589 - val_acc: 0.7759 Epoch 9/100 37/37 [==============================] - 6s 152ms/step - loss: 0.5739 - acc: 0.7658 - val_loss: 0.5535 - val_acc: 0.7798 Epoch 10/100 37/37 [==============================] - 6s 152ms/step - loss: 0.5501 - acc: 0.7817 - val_loss: 0.5414 - val_acc: 0.7857 Epoch 11/100 37/37 [==============================] - 6s 150ms/step - loss: 0.5534 - acc: 0.7722 - val_loss: 0.5387 - val_acc: 0.7917 Epoch 12/100 37/37 [==============================] - 6s 151ms/step - loss: 0.5407 - acc: 0.7826 - val_loss: 0.5337 - val_acc: 0.7926 Epoch 13/100 37/37 [==============================] - 6s 151ms/step - loss: 0.5379 - acc: 0.7824 - val_loss: 0.5382 - val_acc: 0.7896 Epoch 14/100 37/37 [==============================] - 6s 153ms/step - loss: 0.5227 - acc: 0.7911 - val_loss: 0.5240 - val_acc: 0.7951 Epoch 15/100 37/37 [==============================] - 6s 152ms/step - loss: 0.5175 - acc: 0.7927 - val_loss: 0.5207 - val_acc: 0.7930 Epoch 16/100 37/37 [==============================] - 6s 151ms/step - loss: 0.5128 - acc: 0.7953 - val_loss: 0.5139 - val_acc: 0.7968 Epoch 17/100 37/37 [==============================] - 6s 152ms/step - loss: 0.5095 - acc: 0.7983 - val_loss: 0.5204 - val_acc: 0.7994 Epoch 18/100 37/37 [==============================] - 6s 152ms/step - loss: 0.5032 - acc: 0.8014 - val_loss: 0.5185 - val_acc: 0.7917 Epoch 19/100 37/37 [==============================] - 6s 152ms/step - loss: 0.4959 - acc: 0.7983 - val_loss: 0.5133 - val_acc: 0.7977 Epoch 20/100 37/37 [==============================] - 6s 152ms/step - loss: 0.4884 - acc: 0.8033 - val_loss: 0.5259 - val_acc: 0.8011 Epoch 21/100 37/37 [==============================] - 6s 151ms/step - loss: 0.4950 - acc: 0.8034 - val_loss: 0.5158 - val_acc: 0.8011 Epoch 22/100 37/37 [==============================] - 6s 151ms/step - loss: 0.4855 - acc: 0.8045 - val_loss: 0.5087 - val_acc: 0.8007 Epoch 23/100 37/37 [==============================] - 6s 153ms/step - loss: 0.4705 - acc: 0.8179 - val_loss: 0.5134 - val_acc: 0.8024 Epoch 24/100 37/37 [==============================] - 6s 153ms/step - loss: 0.4674 - acc: 0.8134 - val_loss: 0.5116 - val_acc: 0.7947 Epoch 25/100 37/37 [==============================] - 6s 152ms/step - loss: 0.4672 - acc: 0.8152 - val_loss: 0.5068 - val_acc: 0.8045 Epoch 26/100 37/37 [==============================] - 6s 152ms/step - loss: 0.4538 - acc: 0.8196 - val_loss: 0.5316 - val_acc: 0.7973 Epoch 27/100 37/37 [==============================] - 6s 153ms/step - loss: 0.4609 - acc: 0.8172 - val_loss: 0.5375 - val_acc: 0.7990 Epoch 28/100 37/37 [==============================] - 6s 151ms/step - loss: 0.4541 - acc: 0.8236 - val_loss: 0.5035 - val_acc: 0.8092 Epoch 29/100 37/37 [==============================] - 6s 153ms/step - loss: 0.4604 - acc: 0.8196 - val_loss: 0.5283 - val_acc: 0.7823 Epoch 30/100 37/37 [==============================] - 6s 153ms/step - loss: 0.4529 - acc: 0.8211 - val_loss: 0.5083 - val_acc: 0.8088 Epoch 31/100 37/37 [==============================] - 6s 154ms/step - loss: 0.4384 - acc: 0.8257 - val_loss: 0.5049 - val_acc: 0.8058 Epoch 32/100 37/37 [==============================] - 6s 151ms/step - loss: 0.4402 - acc: 0.8240 - val_loss: 0.5020 - val_acc: 0.8011 Epoch 33/100 37/37 [==============================] - 6s 152ms/step - loss: 0.4331 - acc: 0.8279 - val_loss: 0.5160 - val_acc: 0.8096 Epoch 34/100 37/37 [==============================] - 6s 154ms/step - loss: 0.4314 - acc: 0.8271 - val_loss: 0.5318 - val_acc: 0.8088 Epoch 35/100 37/37 [==============================] - 6s 149ms/step - loss: 0.4284 - acc: 0.8329 - val_loss: 0.5141 - val_acc: 0.8122 Epoch 36/100 37/37 [==============================] - 6s 152ms/step - loss: 0.4250 - acc: 0.8351 - val_loss: 0.4956 - val_acc: 0.8096 Epoch 37/100 37/37 [==============================] - 6s 152ms/step - loss: 0.4229 - acc: 0.8307 - val_loss: 0.5138 - val_acc: 0.8067 Epoch 38/100 37/37 [==============================] - 6s 151ms/step - loss: 0.4167 - acc: 0.8393 - val_loss: 0.5096 - val_acc: 0.8020 Epoch 39/100 37/37 [==============================] - 6s 151ms/step - loss: 0.4089 - acc: 0.8388 - val_loss: 0.5243 - val_acc: 0.8050 Epoch 40/100 37/37 [==============================] - 6s 151ms/step - loss: 0.4144 - acc: 0.8356 - val_loss: 0.5111 - val_acc: 0.8118 Epoch 41/100 37/37 [==============================] - 6s 152ms/step - loss: 0.4016 - acc: 0.8401 - val_loss: 0.5095 - val_acc: 0.8122 Epoch 42/100 37/37 [==============================] - 6s 152ms/step - loss: 0.4016 - acc: 0.8416 - val_loss: 0.5042 - val_acc: 0.8135 Epoch 43/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3975 - acc: 0.8448 - val_loss: 0.5060 - val_acc: 0.8139 Epoch 44/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3898 - acc: 0.8485 - val_loss: 0.5492 - val_acc: 0.8092 Epoch 45/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3915 - acc: 0.8512 - val_loss: 0.5096 - val_acc: 0.8173 Epoch 46/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3849 - acc: 0.8521 - val_loss: 0.5188 - val_acc: 0.8109 Epoch 47/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3834 - acc: 0.8488 - val_loss: 0.5408 - val_acc: 0.8075 Epoch 48/100 37/37 [==============================] - 6s 151ms/step - loss: 0.3879 - acc: 0.8491 - val_loss: 0.5339 - val_acc: 0.8105 Epoch 49/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3879 - acc: 0.8451 - val_loss: 0.5170 - val_acc: 0.8109 Epoch 50/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3806 - acc: 0.8495 - val_loss: 0.5200 - val_acc: 0.8084 Epoch 51/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3829 - acc: 0.8492 - val_loss: 0.5622 - val_acc: 0.7998 Epoch 52/100 37/37 [==============================] - 6s 154ms/step - loss: 0.3734 - acc: 0.8542 - val_loss: 0.5326 - val_acc: 0.8092 Epoch 53/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3721 - acc: 0.8557 - val_loss: 0.5226 - val_acc: 0.8092 Epoch 54/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3664 - acc: 0.8559 - val_loss: 0.5291 - val_acc: 0.8105 Epoch 55/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3673 - acc: 0.8571 - val_loss: 0.5723 - val_acc: 0.8114 Epoch 56/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3538 - acc: 0.8614 - val_loss: 0.5274 - val_acc: 0.8126 Epoch 57/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3564 - acc: 0.8615 - val_loss: 0.5285 - val_acc: 0.8105 Epoch 58/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3592 - acc: 0.8606 - val_loss: 0.5391 - val_acc: 0.8084 Epoch 59/100 37/37 [==============================] - 6s 151ms/step - loss: 0.3594 - acc: 0.8591 - val_loss: 0.5802 - val_acc: 0.8011 Epoch 60/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3544 - acc: 0.8608 - val_loss: 0.5294 - val_acc: 0.8118 Epoch 61/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3460 - acc: 0.8627 - val_loss: 0.5368 - val_acc: 0.8135 Epoch 62/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3442 - acc: 0.8644 - val_loss: 0.5414 - val_acc: 0.8109 Epoch 63/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3482 - acc: 0.8626 - val_loss: 0.5696 - val_acc: 0.8105 Epoch 64/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3468 - acc: 0.8644 - val_loss: 0.5545 - val_acc: 0.8143 Epoch 65/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3365 - acc: 0.8681 - val_loss: 0.5240 - val_acc: 0.8028 Epoch 66/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3318 - acc: 0.8715 - val_loss: 0.5618 - val_acc: 0.8088 Epoch 67/100 37/37 [==============================] - 6s 154ms/step - loss: 0.3374 - acc: 0.8684 - val_loss: 0.5790 - val_acc: 0.8160 Epoch 68/100 37/37 [==============================] - 6s 155ms/step - loss: 0.3359 - acc: 0.8679 - val_loss: 0.5644 - val_acc: 0.8135 Epoch 69/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3236 - acc: 0.8757 - val_loss: 0.5652 - val_acc: 0.7930 Epoch 70/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3351 - acc: 0.8721 - val_loss: 0.5525 - val_acc: 0.8156 Epoch 71/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3275 - acc: 0.8719 - val_loss: 0.5637 - val_acc: 0.8101 Epoch 72/100 37/37 [==============================] - 6s 154ms/step - loss: 0.3130 - acc: 0.8750 - val_loss: 0.5577 - val_acc: 0.8101 Epoch 73/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3162 - acc: 0.8764 - val_loss: 0.5808 - val_acc: 0.8131 Epoch 74/100 37/37 [==============================] - 6s 154ms/step - loss: 0.3125 - acc: 0.8773 - val_loss: 0.5785 - val_acc: 0.8114 Epoch 75/100 37/37 [==============================] - 6s 159ms/step - loss: 0.3169 - acc: 0.8753 - val_loss: 0.5946 - val_acc: 0.8169 Epoch 76/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3154 - acc: 0.8773 - val_loss: 0.5900 - val_acc: 0.8050 Epoch 77/100 37/37 [==============================] - 6s 155ms/step - loss: 0.3078 - acc: 0.8809 - val_loss: 0.5697 - val_acc: 0.8088 Epoch 78/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3083 - acc: 0.8779 - val_loss: 0.6369 - val_acc: 0.8024 Epoch 79/100 37/37 [==============================] - 6s 154ms/step - loss: 0.3100 - acc: 0.8791 - val_loss: 0.5820 - val_acc: 0.8109 Epoch 80/100 37/37 [==============================] - 6s 152ms/step - loss: 0.3065 - acc: 0.8804 - val_loss: 0.5887 - val_acc: 0.8084 Epoch 81/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3011 - acc: 0.8829 - val_loss: 0.5691 - val_acc: 0.8071 Epoch 82/100 37/37 [==============================] - 6s 154ms/step - loss: 0.2989 - acc: 0.8789 - val_loss: 0.6068 - val_acc: 0.8096 Epoch 83/100 37/37 [==============================] - 6s 153ms/step - loss: 0.3005 - acc: 0.8827 - val_loss: 0.5952 - val_acc: 0.8118 Epoch 84/100 37/37 [==============================] - 6s 154ms/step - loss: 0.2875 - acc: 0.8862 - val_loss: 0.6008 - val_acc: 0.8126 Epoch 85/100 37/37 [==============================] - 6s 152ms/step - loss: 0.2944 - acc: 0.8858 - val_loss: 0.6086 - val_acc: 0.8067 Epoch 86/100 37/37 [==============================] - 6s 153ms/step - loss: 0.2896 - acc: 0.8870 - val_loss: 0.6058 - val_acc: 0.8143 Epoch 87/100 37/37 [==============================] - 6s 153ms/step - loss: 0.2899 - acc: 0.8836 - val_loss: 0.6185 - val_acc: 0.8118 Epoch 88/100 37/37 [==============================] - 6s 154ms/step - loss: 0.2851 - acc: 0.8871 - val_loss: 0.6228 - val_acc: 0.8105 Epoch 89/100 37/37 [==============================] - 6s 154ms/step - loss: 0.2855 - acc: 0.8874 - val_loss: 0.6152 - val_acc: 0.8092 Epoch 90/100 37/37 [==============================] - 6s 156ms/step - loss: 0.2873 - acc: 0.8873 - val_loss: 0.6188 - val_acc: 0.8084 Epoch 91/100 37/37 [==============================] - 6s 152ms/step - loss: 0.2816 - acc: 0.8903 - val_loss: 0.6431 - val_acc: 0.8075 Epoch 92/100 37/37 [==============================] - 6s 153ms/step - loss: 0.2866 - acc: 0.8865 - val_loss: 0.6013 - val_acc: 0.8105 Epoch 93/100 37/37 [==============================] - 6s 155ms/step - loss: 0.2715 - acc: 0.8932 - val_loss: 0.6370 - val_acc: 0.8054 Epoch 94/100 37/37 [==============================] - 6s 156ms/step - loss: 0.2827 - acc: 0.8872 - val_loss: 0.6334 - val_acc: 0.8126 Epoch 95/100 37/37 [==============================] - 6s 156ms/step - loss: 0.2767 - acc: 0.8937 - val_loss: 0.6588 - val_acc: 0.8062 Epoch 96/100 37/37 [==============================] - 6s 152ms/step - loss: 0.2755 - acc: 0.8929 - val_loss: 0.6078 - val_acc: 0.8135 Epoch 97/100 37/37 [==============================] - 6s 153ms/step - loss: 0.2660 - acc: 0.8970 - val_loss: 0.6775 - val_acc: 0.8084 Epoch 98/100 37/37 [==============================] - 6s 152ms/step - loss: 0.2695 - acc: 0.8969 - val_loss: 0.6610 - val_acc: 0.8092 Epoch 99/100 37/37 [==============================] - 6s 151ms/step - loss: 0.2736 - acc: 0.8920 - val_loss: 0.6290 - val_acc: 0.8050 Epoch 100/100 37/37 [==============================] - 6s 155ms/step - loss: 0.2669 - acc: 0.8967 - val_loss: 0.6277 - val_acc: 0.8118
# Train and test accuracy
loss, accuracy = lstm_model2.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = lstm_model2.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
Training Accuracy: 0.9170 Testing Accuracy: 0.7879
# Plot train/test loss and accuracy
acc = hist_2.history['acc']
val_acc = hist_2.history['val_acc']
loss = hist_2.history['loss']
val_loss = hist_2.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'g', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'g', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
Observation:
Accuracy steadily increases to the point of overfitting in training but peaks at about 0.80 in validation.
Loss decreases steadily in training but initially dips and then increases to between 60% and 70%.
# Get the predicted values.
y_pred = lstm_model2.predict(X_test) # outputs probabilities of each sentiment
# Create empty numpy array to match length of training observations
y_pred_array = np.zeros(X_test.shape[0])
# Find the class with highest probability.
for i in range(0, y_pred.shape[0]):
label_predict = np.argmax(y_pred[i]) # column with max probability
y_pred_array[i] = label_predict
# convert to integers
y_pred_array = y_pred_array.astype(int)
np.set_printoptions(precision=2)
# Plot the non-normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names,
title='Confusion matrix, without normalization')
# Plot the normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names, normalize=True,
title='Normalized confusion matrix')
plt.show()
Confusion matrix, without normalization [[1643 147 80] [ 195 349 70] [ 74 55 315]] Normalized confusion matrix [[0.88 0.08 0.04] [0.32 0.57 0.11] [0.17 0.12 0.71]]
Observations:
We see in the above confusion matrices, Model 2 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance. Model 2 performed better than Model 1 with the positive and neutral labels but performed approximately as well as Model 1 with the negative lables.
Model 3: LSTM Layer Stacking.
# LSTM Model.
lstm_model3 = Sequential()
lstm_model3.add(embedding_layer)
lstm_model3.add(LSTM(256,
dropout = 0.2,
recurrent_dropout = 0.5,
return_sequences = True))
lstm_model3.add(LSTM(128,
dropout = 0.2,
recurrent_dropout = 0.5))
lstm_model3.add(Dense(3, activation='softmax'))
lstm_model3.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
lstm_model3.summary()
Model: "sequential_3" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 30, 100) 1576900 _________________________________________________________________ lstm_2 (LSTM) (None, 30, 256) 365568 _________________________________________________________________ lstm_3 (LSTM) (None, 128) 197120 _________________________________________________________________ dense_6 (Dense) (None, 3) 387 ================================================================= Total params: 2,139,975 Trainable params: 563,075 Non-trainable params: 1,576,900 _________________________________________________________________
# Neural network parameters.
history_3 = lstm_model3.fit(X_train, y_train,
validation_split = 0.2,
epochs=100, batch_size=256)
Epoch 1/100 37/37 [==============================] - 48s 1s/step - loss: 0.8296 - acc: 0.6373 - val_loss: 0.7543 - val_acc: 0.6961 Epoch 2/100 37/37 [==============================] - 42s 1s/step - loss: 0.6983 - acc: 0.7153 - val_loss: 0.6344 - val_acc: 0.7405 Epoch 3/100 37/37 [==============================] - 42s 1s/step - loss: 0.6346 - acc: 0.7387 - val_loss: 0.6125 - val_acc: 0.7512 Epoch 4/100 37/37 [==============================] - 42s 1s/step - loss: 0.6151 - acc: 0.7494 - val_loss: 0.6198 - val_acc: 0.7537 Epoch 5/100 37/37 [==============================] - 42s 1s/step - loss: 0.6103 - acc: 0.7522 - val_loss: 0.5804 - val_acc: 0.7627 Epoch 6/100 37/37 [==============================] - 42s 1s/step - loss: 0.5837 - acc: 0.7588 - val_loss: 0.5634 - val_acc: 0.7746 Epoch 7/100 37/37 [==============================] - 42s 1s/step - loss: 0.5663 - acc: 0.7681 - val_loss: 0.5411 - val_acc: 0.7896 Epoch 8/100 37/37 [==============================] - 42s 1s/step - loss: 0.5590 - acc: 0.7732 - val_loss: 0.5517 - val_acc: 0.7793 Epoch 9/100 37/37 [==============================] - 42s 1s/step - loss: 0.5393 - acc: 0.7827 - val_loss: 0.5278 - val_acc: 0.7909 Epoch 10/100 37/37 [==============================] - 42s 1s/step - loss: 0.5187 - acc: 0.7957 - val_loss: 0.5346 - val_acc: 0.7926 Epoch 11/100 37/37 [==============================] - 42s 1s/step - loss: 0.5149 - acc: 0.7938 - val_loss: 0.5295 - val_acc: 0.7836 Epoch 12/100 37/37 [==============================] - 42s 1s/step - loss: 0.4934 - acc: 0.8018 - val_loss: 0.5145 - val_acc: 0.7985 Epoch 13/100 37/37 [==============================] - 42s 1s/step - loss: 0.4955 - acc: 0.8017 - val_loss: 0.5479 - val_acc: 0.7917 Epoch 14/100 37/37 [==============================] - 41s 1s/step - loss: 0.4923 - acc: 0.8035 - val_loss: 0.5384 - val_acc: 0.7836 Epoch 15/100 37/37 [==============================] - 41s 1s/step - loss: 0.4653 - acc: 0.8162 - val_loss: 0.5062 - val_acc: 0.8054 Epoch 16/100 37/37 [==============================] - 41s 1s/step - loss: 0.4595 - acc: 0.8133 - val_loss: 0.5157 - val_acc: 0.7904 Epoch 17/100 37/37 [==============================] - 41s 1s/step - loss: 0.4506 - acc: 0.8175 - val_loss: 0.5264 - val_acc: 0.7994 Epoch 18/100 37/37 [==============================] - 42s 1s/step - loss: 0.4506 - acc: 0.8227 - val_loss: 0.5264 - val_acc: 0.7973 Epoch 19/100 37/37 [==============================] - 42s 1s/step - loss: 0.4252 - acc: 0.8300 - val_loss: 0.5158 - val_acc: 0.7968 Epoch 20/100 37/37 [==============================] - 42s 1s/step - loss: 0.4107 - acc: 0.8370 - val_loss: 0.5316 - val_acc: 0.7964 Epoch 21/100 37/37 [==============================] - 42s 1s/step - loss: 0.4079 - acc: 0.8369 - val_loss: 0.5028 - val_acc: 0.8003 Epoch 22/100 37/37 [==============================] - 42s 1s/step - loss: 0.4061 - acc: 0.8409 - val_loss: 0.5242 - val_acc: 0.7994 Epoch 23/100 37/37 [==============================] - 42s 1s/step - loss: 0.3820 - acc: 0.8501 - val_loss: 0.5550 - val_acc: 0.7956 Epoch 24/100 37/37 [==============================] - 42s 1s/step - loss: 0.3736 - acc: 0.8535 - val_loss: 0.5316 - val_acc: 0.7964 Epoch 25/100 37/37 [==============================] - 42s 1s/step - loss: 0.3706 - acc: 0.8536 - val_loss: 0.6038 - val_acc: 0.7802 Epoch 26/100 37/37 [==============================] - 42s 1s/step - loss: 0.3680 - acc: 0.8523 - val_loss: 0.5613 - val_acc: 0.8015 Epoch 27/100 37/37 [==============================] - 42s 1s/step - loss: 0.3495 - acc: 0.8588 - val_loss: 0.5442 - val_acc: 0.8045 Epoch 28/100 37/37 [==============================] - 42s 1s/step - loss: 0.3447 - acc: 0.8641 - val_loss: 0.5855 - val_acc: 0.8024 Epoch 29/100 37/37 [==============================] - 41s 1s/step - loss: 0.3395 - acc: 0.8681 - val_loss: 0.5849 - val_acc: 0.7909 Epoch 30/100 37/37 [==============================] - 42s 1s/step - loss: 0.3217 - acc: 0.8758 - val_loss: 0.6040 - val_acc: 0.8020 Epoch 31/100 37/37 [==============================] - 42s 1s/step - loss: 0.3013 - acc: 0.8841 - val_loss: 0.5826 - val_acc: 0.8024 Epoch 32/100 37/37 [==============================] - 42s 1s/step - loss: 0.2999 - acc: 0.8808 - val_loss: 0.5795 - val_acc: 0.8084 Epoch 33/100 37/37 [==============================] - 42s 1s/step - loss: 0.2825 - acc: 0.8923 - val_loss: 0.6049 - val_acc: 0.7981 Epoch 34/100 37/37 [==============================] - 42s 1s/step - loss: 0.2725 - acc: 0.8957 - val_loss: 0.6423 - val_acc: 0.8007 Epoch 35/100 37/37 [==============================] - 42s 1s/step - loss: 0.2658 - acc: 0.8961 - val_loss: 0.6111 - val_acc: 0.7849 Epoch 36/100 37/37 [==============================] - 41s 1s/step - loss: 0.2628 - acc: 0.8999 - val_loss: 0.6082 - val_acc: 0.8007 Epoch 37/100 37/37 [==============================] - 42s 1s/step - loss: 0.2468 - acc: 0.9051 - val_loss: 0.6543 - val_acc: 0.7947 Epoch 38/100 37/37 [==============================] - 42s 1s/step - loss: 0.2371 - acc: 0.9061 - val_loss: 0.6526 - val_acc: 0.7947 Epoch 39/100 37/37 [==============================] - 42s 1s/step - loss: 0.2264 - acc: 0.9126 - val_loss: 0.6909 - val_acc: 0.7951 Epoch 40/100 37/37 [==============================] - 42s 1s/step - loss: 0.2272 - acc: 0.9130 - val_loss: 0.7171 - val_acc: 0.7934 Epoch 41/100 37/37 [==============================] - 41s 1s/step - loss: 0.2371 - acc: 0.9119 - val_loss: 0.6705 - val_acc: 0.7964 Epoch 42/100 37/37 [==============================] - 41s 1s/step - loss: 0.2111 - acc: 0.9195 - val_loss: 0.7196 - val_acc: 0.7960 Epoch 43/100 37/37 [==============================] - 42s 1s/step - loss: 0.2053 - acc: 0.9217 - val_loss: 0.6738 - val_acc: 0.7934 Epoch 44/100 37/37 [==============================] - 41s 1s/step - loss: 0.1948 - acc: 0.9285 - val_loss: 0.7195 - val_acc: 0.7913 Epoch 45/100 37/37 [==============================] - 42s 1s/step - loss: 0.1939 - acc: 0.9276 - val_loss: 0.7363 - val_acc: 0.7973 Epoch 46/100 37/37 [==============================] - 42s 1s/step - loss: 0.1850 - acc: 0.9304 - val_loss: 0.7593 - val_acc: 0.7973 Epoch 47/100 37/37 [==============================] - 42s 1s/step - loss: 0.1797 - acc: 0.9307 - val_loss: 0.7391 - val_acc: 0.7968 Epoch 48/100 37/37 [==============================] - 42s 1s/step - loss: 0.1706 - acc: 0.9330 - val_loss: 0.7974 - val_acc: 0.7909 Epoch 49/100 37/37 [==============================] - 41s 1s/step - loss: 0.1550 - acc: 0.9403 - val_loss: 0.8342 - val_acc: 0.7913 Epoch 50/100 37/37 [==============================] - 41s 1s/step - loss: 0.1609 - acc: 0.9377 - val_loss: 0.7406 - val_acc: 0.7828 Epoch 51/100 37/37 [==============================] - 42s 1s/step - loss: 0.1647 - acc: 0.9377 - val_loss: 0.7853 - val_acc: 0.7892 Epoch 52/100 37/37 [==============================] - 41s 1s/step - loss: 0.1499 - acc: 0.9431 - val_loss: 0.8204 - val_acc: 0.7930 Epoch 53/100 37/37 [==============================] - 41s 1s/step - loss: 0.1425 - acc: 0.9475 - val_loss: 0.7902 - val_acc: 0.7981 Epoch 54/100 37/37 [==============================] - 42s 1s/step - loss: 0.1290 - acc: 0.9549 - val_loss: 0.8360 - val_acc: 0.7960 Epoch 55/100 37/37 [==============================] - 41s 1s/step - loss: 0.1372 - acc: 0.9475 - val_loss: 0.8373 - val_acc: 0.7951 Epoch 56/100 37/37 [==============================] - 41s 1s/step - loss: 0.1336 - acc: 0.9521 - val_loss: 0.8406 - val_acc: 0.8007 Epoch 57/100 37/37 [==============================] - 42s 1s/step - loss: 0.1320 - acc: 0.9500 - val_loss: 0.8728 - val_acc: 0.8024 Epoch 58/100 37/37 [==============================] - 42s 1s/step - loss: 0.1203 - acc: 0.9562 - val_loss: 0.8628 - val_acc: 0.7883 Epoch 59/100 37/37 [==============================] - 42s 1s/step - loss: 0.1177 - acc: 0.9570 - val_loss: 0.9102 - val_acc: 0.7964 Epoch 60/100 37/37 [==============================] - 42s 1s/step - loss: 0.1226 - acc: 0.9555 - val_loss: 0.8726 - val_acc: 0.8003 Epoch 61/100 37/37 [==============================] - 41s 1s/step - loss: 0.1172 - acc: 0.9579 - val_loss: 0.8623 - val_acc: 0.7973 Epoch 62/100 37/37 [==============================] - 41s 1s/step - loss: 0.1036 - acc: 0.9634 - val_loss: 0.8891 - val_acc: 0.7921 Epoch 63/100 37/37 [==============================] - 41s 1s/step - loss: 0.1085 - acc: 0.9594 - val_loss: 0.8458 - val_acc: 0.7947 Epoch 64/100 37/37 [==============================] - 41s 1s/step - loss: 0.0982 - acc: 0.9642 - val_loss: 0.9069 - val_acc: 0.7802 Epoch 65/100 37/37 [==============================] - 41s 1s/step - loss: 0.1029 - acc: 0.9637 - val_loss: 0.9202 - val_acc: 0.7956 Epoch 66/100 37/37 [==============================] - 41s 1s/step - loss: 0.1028 - acc: 0.9629 - val_loss: 0.9428 - val_acc: 0.7900 Epoch 67/100 37/37 [==============================] - 41s 1s/step - loss: 0.0995 - acc: 0.9654 - val_loss: 0.9405 - val_acc: 0.7981 Epoch 68/100 37/37 [==============================] - 41s 1s/step - loss: 0.0892 - acc: 0.9683 - val_loss: 1.0099 - val_acc: 0.7815 Epoch 69/100 37/37 [==============================] - 41s 1s/step - loss: 0.1096 - acc: 0.9597 - val_loss: 0.8212 - val_acc: 0.7900 Epoch 70/100 37/37 [==============================] - 41s 1s/step - loss: 0.0934 - acc: 0.9657 - val_loss: 0.9276 - val_acc: 0.7798 Epoch 71/100 37/37 [==============================] - 42s 1s/step - loss: 0.0973 - acc: 0.9656 - val_loss: 0.8697 - val_acc: 0.7887 Epoch 72/100 37/37 [==============================] - 41s 1s/step - loss: 0.0928 - acc: 0.9677 - val_loss: 0.9928 - val_acc: 0.7892 Epoch 73/100 37/37 [==============================] - 41s 1s/step - loss: 0.0844 - acc: 0.9712 - val_loss: 0.9637 - val_acc: 0.7832 Epoch 74/100 37/37 [==============================] - 41s 1s/step - loss: 0.0849 - acc: 0.9711 - val_loss: 0.9484 - val_acc: 0.7977 Epoch 75/100 37/37 [==============================] - 41s 1s/step - loss: 0.0776 - acc: 0.9722 - val_loss: 1.0214 - val_acc: 0.7870 Epoch 76/100 37/37 [==============================] - 41s 1s/step - loss: 0.0844 - acc: 0.9721 - val_loss: 1.0064 - val_acc: 0.7926 Epoch 77/100 37/37 [==============================] - 41s 1s/step - loss: 0.0840 - acc: 0.9688 - val_loss: 0.9072 - val_acc: 0.7887 Epoch 78/100 37/37 [==============================] - 41s 1s/step - loss: 0.0776 - acc: 0.9714 - val_loss: 0.9859 - val_acc: 0.7930 Epoch 79/100 37/37 [==============================] - 41s 1s/step - loss: 0.0677 - acc: 0.9768 - val_loss: 1.0185 - val_acc: 0.7990 Epoch 80/100 37/37 [==============================] - 41s 1s/step - loss: 0.0674 - acc: 0.9768 - val_loss: 1.0220 - val_acc: 0.7926 Epoch 81/100 37/37 [==============================] - 41s 1s/step - loss: 0.0843 - acc: 0.9702 - val_loss: 0.9675 - val_acc: 0.7985 Epoch 82/100 37/37 [==============================] - 41s 1s/step - loss: 0.0646 - acc: 0.9776 - val_loss: 1.0475 - val_acc: 0.7892 Epoch 83/100 37/37 [==============================] - 42s 1s/step - loss: 0.0723 - acc: 0.9771 - val_loss: 1.0319 - val_acc: 0.7909 Epoch 84/100 37/37 [==============================] - 42s 1s/step - loss: 0.0728 - acc: 0.9753 - val_loss: 0.9993 - val_acc: 0.7973 Epoch 85/100 37/37 [==============================] - 42s 1s/step - loss: 0.0713 - acc: 0.9745 - val_loss: 1.0329 - val_acc: 0.7674 Epoch 86/100 37/37 [==============================] - 42s 1s/step - loss: 0.0685 - acc: 0.9752 - val_loss: 1.0865 - val_acc: 0.7866 Epoch 87/100 37/37 [==============================] - 42s 1s/step - loss: 0.0813 - acc: 0.9704 - val_loss: 0.9698 - val_acc: 0.7939 Epoch 88/100 37/37 [==============================] - 42s 1s/step - loss: 0.0649 - acc: 0.9789 - val_loss: 1.0628 - val_acc: 0.7943 Epoch 89/100 37/37 [==============================] - 42s 1s/step - loss: 0.0643 - acc: 0.9771 - val_loss: 1.0890 - val_acc: 0.7926 Epoch 90/100 37/37 [==============================] - 42s 1s/step - loss: 0.0665 - acc: 0.9788 - val_loss: 1.0384 - val_acc: 0.7917 Epoch 91/100 37/37 [==============================] - 42s 1s/step - loss: 0.0596 - acc: 0.9791 - val_loss: 1.1166 - val_acc: 0.7994 Epoch 92/100 37/37 [==============================] - 42s 1s/step - loss: 0.0617 - acc: 0.9788 - val_loss: 1.0272 - val_acc: 0.7853 Epoch 93/100 37/37 [==============================] - 42s 1s/step - loss: 0.0632 - acc: 0.9773 - val_loss: 1.0148 - val_acc: 0.7926 Epoch 94/100 37/37 [==============================] - 42s 1s/step - loss: 0.0593 - acc: 0.9781 - val_loss: 1.0725 - val_acc: 0.7862 Epoch 95/100 37/37 [==============================] - 41s 1s/step - loss: 0.0557 - acc: 0.9789 - val_loss: 1.0916 - val_acc: 0.7939 Epoch 96/100 37/37 [==============================] - 42s 1s/step - loss: 0.0614 - acc: 0.9791 - val_loss: 1.0473 - val_acc: 0.7973 Epoch 97/100 37/37 [==============================] - 42s 1s/step - loss: 0.0702 - acc: 0.9755 - val_loss: 1.0116 - val_acc: 0.7985 Epoch 98/100 37/37 [==============================] - 42s 1s/step - loss: 0.0561 - acc: 0.9814 - val_loss: 1.0888 - val_acc: 0.7892 Epoch 99/100 37/37 [==============================] - 41s 1s/step - loss: 0.0589 - acc: 0.9781 - val_loss: 1.0623 - val_acc: 0.7934 Epoch 100/100 37/37 [==============================] - 41s 1s/step - loss: 0.0539 - acc: 0.9816 - val_loss: 1.0261 - val_acc: 0.7939
# Find train and test accuracy.
loss, accuracy = lstm_model3.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = lstm_model3.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
Training Accuracy: 0.9548 Testing Accuracy: 0.7862
# Plot train/test loss and accuracy.
acc = history_3.history['acc']
val_acc = history_3.history['val_acc']
loss = history_3.history['loss']
val_loss = history_3.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'g', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'g', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
Observation:
Accuracy steadily increases to the point of overfitting in training but peaks at about 0.80 in validation.
Loss decreases steadily in training but initially dips and then increases to over 100%. Not ideal.
# Get the predicted values.
y_pred = lstm_model3.predict(X_test) # Outputs probabilities of each sentiment.
# Create empty numpy array to match length of training observations.
y_pred_array = np.zeros(X_test.shape[0])
# Find the class with the highest probability.
for i in range(0, y_pred.shape[0]):
label_predict = np.argmax(y_pred[i]) # Column with max probability.
y_pred_array[i] = label_predict
# Convert to integers.
y_pred_array = y_pred_array.astype(int)
np.set_printoptions(precision=2)
# Plot the non-normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names,
title='Confusion matrix, without normalization')
# Plot the normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names, normalize=True,
title='Normalized confusion matrix')
plt.show()
Confusion matrix, without normalization [[1650 142 78] [ 218 323 73] [ 75 40 329]] Normalized confusion matrix [[0.88 0.08 0.04] [0.36 0.53 0.12] [0.17 0.09 0.74]]
Observations:
We see in the above confusion matrices, Model 3 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance. Model 3 performed better than Model 2 with the positive and neutral labels but performed approximately as well as Model 2 with the negative lables.
Model 4: GRU Layer Stacking.
# GRU Model:
gru_model_4 = Sequential()
gru_model_4.add(embedding_layer)
gru_model_4.add(GRU(256,
dropout = 0.2,
recurrent_dropout = 0.5,
return_sequences = True))
gru_model_4.add(GRU(128,
dropout = 0.2,
recurrent_dropout = 0.5))
gru_model_4.add(Dense(3, activation='softmax'))
gru_model_4.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
gru_model_4.summary()
Model: "sequential_4" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 30, 100) 1576900 _________________________________________________________________ gru (GRU) (None, 30, 256) 274944 _________________________________________________________________ gru_1 (GRU) (None, 128) 148224 _________________________________________________________________ dense_7 (Dense) (None, 3) 387 ================================================================= Total params: 2,000,455 Trainable params: 423,555 Non-trainable params: 1,576,900 _________________________________________________________________
# Tune the hyperparameters.
history_4 = gru_model_4.fit(X_train, y_train,
validation_split = 0.2,
epochs=100, batch_size=256)
Epoch 1/100 37/37 [==============================] - 39s 938ms/step - loss: 0.8825 - acc: 0.6186 - val_loss: 0.8308 - val_acc: 0.6180 Epoch 2/100 37/37 [==============================] - 34s 917ms/step - loss: 0.7510 - acc: 0.6810 - val_loss: 0.7136 - val_acc: 0.6978 Epoch 3/100 37/37 [==============================] - 34s 920ms/step - loss: 0.6900 - acc: 0.7126 - val_loss: 0.6300 - val_acc: 0.7409 Epoch 4/100 37/37 [==============================] - 34s 919ms/step - loss: 0.6172 - acc: 0.7428 - val_loss: 0.5963 - val_acc: 0.7529 Epoch 5/100 37/37 [==============================] - 34s 920ms/step - loss: 0.5923 - acc: 0.7589 - val_loss: 0.5616 - val_acc: 0.7764 Epoch 6/100 37/37 [==============================] - 34s 922ms/step - loss: 0.5717 - acc: 0.7708 - val_loss: 0.5509 - val_acc: 0.7742 Epoch 7/100 37/37 [==============================] - 34s 929ms/step - loss: 0.5580 - acc: 0.7754 - val_loss: 0.5371 - val_acc: 0.7815 Epoch 8/100 37/37 [==============================] - 34s 922ms/step - loss: 0.5368 - acc: 0.7801 - val_loss: 0.5243 - val_acc: 0.7887 Epoch 9/100 37/37 [==============================] - 34s 925ms/step - loss: 0.5390 - acc: 0.7830 - val_loss: 0.5389 - val_acc: 0.7845 Epoch 10/100 37/37 [==============================] - 34s 921ms/step - loss: 0.5181 - acc: 0.7931 - val_loss: 0.5368 - val_acc: 0.7875 Epoch 11/100 37/37 [==============================] - 34s 920ms/step - loss: 0.5041 - acc: 0.7953 - val_loss: 0.5140 - val_acc: 0.7939 Epoch 12/100 37/37 [==============================] - 34s 921ms/step - loss: 0.4941 - acc: 0.8044 - val_loss: 0.4943 - val_acc: 0.8062 Epoch 13/100 37/37 [==============================] - 34s 922ms/step - loss: 0.4761 - acc: 0.8089 - val_loss: 0.5202 - val_acc: 0.7866 Epoch 14/100 37/37 [==============================] - 34s 924ms/step - loss: 0.4766 - acc: 0.8067 - val_loss: 0.5025 - val_acc: 0.8020 Epoch 15/100 37/37 [==============================] - 34s 925ms/step - loss: 0.4634 - acc: 0.8137 - val_loss: 0.5135 - val_acc: 0.8101 Epoch 16/100 37/37 [==============================] - 34s 927ms/step - loss: 0.4500 - acc: 0.8192 - val_loss: 0.4992 - val_acc: 0.8114 Epoch 17/100 37/37 [==============================] - 34s 929ms/step - loss: 0.4344 - acc: 0.8333 - val_loss: 0.4992 - val_acc: 0.8131 Epoch 18/100 37/37 [==============================] - 34s 922ms/step - loss: 0.4307 - acc: 0.8264 - val_loss: 0.5142 - val_acc: 0.7994 Epoch 19/100 37/37 [==============================] - 34s 919ms/step - loss: 0.4226 - acc: 0.8341 - val_loss: 0.5091 - val_acc: 0.8079 Epoch 20/100 37/37 [==============================] - 34s 921ms/step - loss: 0.4073 - acc: 0.8417 - val_loss: 0.5121 - val_acc: 0.7977 Epoch 21/100 37/37 [==============================] - 34s 919ms/step - loss: 0.3985 - acc: 0.8437 - val_loss: 0.5020 - val_acc: 0.8114 Epoch 22/100 37/37 [==============================] - 34s 926ms/step - loss: 0.3937 - acc: 0.8485 - val_loss: 0.5153 - val_acc: 0.8024 Epoch 23/100 37/37 [==============================] - 34s 928ms/step - loss: 0.3917 - acc: 0.8486 - val_loss: 0.5115 - val_acc: 0.8067 Epoch 24/100 37/37 [==============================] - 34s 928ms/step - loss: 0.3720 - acc: 0.8552 - val_loss: 0.5039 - val_acc: 0.8122 Epoch 25/100 37/37 [==============================] - 34s 924ms/step - loss: 0.3593 - acc: 0.8620 - val_loss: 0.5254 - val_acc: 0.8165 Epoch 26/100 37/37 [==============================] - 34s 924ms/step - loss: 0.3553 - acc: 0.8607 - val_loss: 0.5121 - val_acc: 0.8143 Epoch 27/100 37/37 [==============================] - 34s 922ms/step - loss: 0.3377 - acc: 0.8742 - val_loss: 0.5182 - val_acc: 0.8092 Epoch 28/100 37/37 [==============================] - 34s 921ms/step - loss: 0.3350 - acc: 0.8694 - val_loss: 0.5282 - val_acc: 0.8143 Epoch 29/100 37/37 [==============================] - 34s 920ms/step - loss: 0.3240 - acc: 0.8706 - val_loss: 0.5255 - val_acc: 0.8139 Epoch 30/100 37/37 [==============================] - 34s 923ms/step - loss: 0.3232 - acc: 0.8732 - val_loss: 0.5221 - val_acc: 0.8079 Epoch 31/100 37/37 [==============================] - 34s 926ms/step - loss: 0.3114 - acc: 0.8800 - val_loss: 0.5365 - val_acc: 0.8169 Epoch 32/100 37/37 [==============================] - 34s 923ms/step - loss: 0.3085 - acc: 0.8816 - val_loss: 0.5391 - val_acc: 0.8148 Epoch 33/100 37/37 [==============================] - 34s 926ms/step - loss: 0.2920 - acc: 0.8887 - val_loss: 0.6063 - val_acc: 0.8084 Epoch 34/100 37/37 [==============================] - 34s 928ms/step - loss: 0.2946 - acc: 0.8892 - val_loss: 0.5874 - val_acc: 0.8071 Epoch 35/100 37/37 [==============================] - 34s 926ms/step - loss: 0.2786 - acc: 0.8945 - val_loss: 0.5769 - val_acc: 0.8114 Epoch 36/100 37/37 [==============================] - 34s 927ms/step - loss: 0.2632 - acc: 0.9011 - val_loss: 0.5959 - val_acc: 0.8114 Epoch 37/100 37/37 [==============================] - 34s 922ms/step - loss: 0.2509 - acc: 0.9055 - val_loss: 0.6328 - val_acc: 0.8126 Epoch 38/100 37/37 [==============================] - 34s 924ms/step - loss: 0.2584 - acc: 0.9006 - val_loss: 0.5966 - val_acc: 0.8109 Epoch 39/100 37/37 [==============================] - 34s 929ms/step - loss: 0.2506 - acc: 0.9045 - val_loss: 0.5830 - val_acc: 0.8092 Epoch 40/100 37/37 [==============================] - 34s 928ms/step - loss: 0.2358 - acc: 0.9090 - val_loss: 0.5749 - val_acc: 0.8122 Epoch 41/100 37/37 [==============================] - 34s 932ms/step - loss: 0.2348 - acc: 0.9103 - val_loss: 0.5968 - val_acc: 0.8092 Epoch 42/100 37/37 [==============================] - 34s 928ms/step - loss: 0.2325 - acc: 0.9129 - val_loss: 0.5815 - val_acc: 0.8105 Epoch 43/100 37/37 [==============================] - 34s 930ms/step - loss: 0.2195 - acc: 0.9183 - val_loss: 0.6243 - val_acc: 0.8096 Epoch 44/100 37/37 [==============================] - 34s 922ms/step - loss: 0.2007 - acc: 0.9232 - val_loss: 0.6749 - val_acc: 0.8084 Epoch 45/100 37/37 [==============================] - 34s 924ms/step - loss: 0.1995 - acc: 0.9257 - val_loss: 0.6325 - val_acc: 0.8032 Epoch 46/100 37/37 [==============================] - 34s 920ms/step - loss: 0.1932 - acc: 0.9280 - val_loss: 0.6630 - val_acc: 0.8075 Epoch 47/100 37/37 [==============================] - 34s 922ms/step - loss: 0.1974 - acc: 0.9252 - val_loss: 0.6765 - val_acc: 0.7964 Epoch 48/100 37/37 [==============================] - 34s 922ms/step - loss: 0.1735 - acc: 0.9347 - val_loss: 0.6988 - val_acc: 0.8071 Epoch 49/100 37/37 [==============================] - 34s 922ms/step - loss: 0.1743 - acc: 0.9362 - val_loss: 0.7013 - val_acc: 0.8131 Epoch 50/100 37/37 [==============================] - 34s 925ms/step - loss: 0.1817 - acc: 0.9333 - val_loss: 0.6216 - val_acc: 0.8062 Epoch 51/100 37/37 [==============================] - 34s 921ms/step - loss: 0.1687 - acc: 0.9368 - val_loss: 0.7041 - val_acc: 0.8071 Epoch 52/100 37/37 [==============================] - 34s 925ms/step - loss: 0.1701 - acc: 0.9352 - val_loss: 0.7158 - val_acc: 0.8067 Epoch 53/100 37/37 [==============================] - 34s 926ms/step - loss: 0.1557 - acc: 0.9425 - val_loss: 0.6809 - val_acc: 0.8058 Epoch 54/100 37/37 [==============================] - 34s 926ms/step - loss: 0.1582 - acc: 0.9425 - val_loss: 0.7242 - val_acc: 0.7913 Epoch 55/100 37/37 [==============================] - 34s 928ms/step - loss: 0.1543 - acc: 0.9431 - val_loss: 0.7438 - val_acc: 0.7977 Epoch 56/100 37/37 [==============================] - 34s 928ms/step - loss: 0.1440 - acc: 0.9465 - val_loss: 0.8015 - val_acc: 0.7849 Epoch 57/100 37/37 [==============================] - 34s 924ms/step - loss: 0.1466 - acc: 0.9444 - val_loss: 0.7610 - val_acc: 0.8084 Epoch 58/100 37/37 [==============================] - 34s 922ms/step - loss: 0.1399 - acc: 0.9493 - val_loss: 0.7607 - val_acc: 0.8109 Epoch 59/100 37/37 [==============================] - 34s 929ms/step - loss: 0.1325 - acc: 0.9515 - val_loss: 0.7857 - val_acc: 0.8114 Epoch 60/100 37/37 [==============================] - 34s 923ms/step - loss: 0.1312 - acc: 0.9522 - val_loss: 0.7754 - val_acc: 0.7990 Epoch 61/100 37/37 [==============================] - 34s 926ms/step - loss: 0.1302 - acc: 0.9512 - val_loss: 0.7690 - val_acc: 0.8020 Epoch 62/100 37/37 [==============================] - 34s 927ms/step - loss: 0.1280 - acc: 0.9518 - val_loss: 0.8083 - val_acc: 0.8007 Epoch 63/100 37/37 [==============================] - 34s 921ms/step - loss: 0.1269 - acc: 0.9523 - val_loss: 0.8054 - val_acc: 0.7998 Epoch 64/100 37/37 [==============================] - 34s 923ms/step - loss: 0.1235 - acc: 0.9519 - val_loss: 0.7992 - val_acc: 0.8067 Epoch 65/100 37/37 [==============================] - 34s 922ms/step - loss: 0.1127 - acc: 0.9569 - val_loss: 0.8407 - val_acc: 0.8024 Epoch 66/100 37/37 [==============================] - 34s 926ms/step - loss: 0.1067 - acc: 0.9594 - val_loss: 0.8327 - val_acc: 0.8003 Epoch 67/100 37/37 [==============================] - 34s 924ms/step - loss: 0.1111 - acc: 0.9589 - val_loss: 0.8565 - val_acc: 0.8041 Epoch 68/100 37/37 [==============================] - 34s 926ms/step - loss: 0.1113 - acc: 0.9568 - val_loss: 0.8498 - val_acc: 0.8075 Epoch 69/100 37/37 [==============================] - 34s 926ms/step - loss: 0.1029 - acc: 0.9616 - val_loss: 0.8338 - val_acc: 0.8067 Epoch 70/100 37/37 [==============================] - 34s 923ms/step - loss: 0.1061 - acc: 0.9600 - val_loss: 0.8402 - val_acc: 0.8020 Epoch 71/100 37/37 [==============================] - 34s 927ms/step - loss: 0.1019 - acc: 0.9618 - val_loss: 0.8572 - val_acc: 0.8071 Epoch 72/100 37/37 [==============================] - 34s 928ms/step - loss: 0.0975 - acc: 0.9653 - val_loss: 0.8880 - val_acc: 0.8007 Epoch 73/100 37/37 [==============================] - 34s 926ms/step - loss: 0.0962 - acc: 0.9654 - val_loss: 0.8980 - val_acc: 0.8084 Epoch 74/100 37/37 [==============================] - 34s 926ms/step - loss: 0.1017 - acc: 0.9631 - val_loss: 0.8547 - val_acc: 0.8067 Epoch 75/100 37/37 [==============================] - 34s 926ms/step - loss: 0.0871 - acc: 0.9689 - val_loss: 0.8792 - val_acc: 0.8088 Epoch 76/100 37/37 [==============================] - 34s 928ms/step - loss: 0.0819 - acc: 0.9704 - val_loss: 0.9172 - val_acc: 0.8054 Epoch 77/100 37/37 [==============================] - 35s 933ms/step - loss: 0.0966 - acc: 0.9622 - val_loss: 0.9165 - val_acc: 0.7951 Epoch 78/100 37/37 [==============================] - 34s 928ms/step - loss: 0.0907 - acc: 0.9652 - val_loss: 0.9287 - val_acc: 0.8109 Epoch 79/100 37/37 [==============================] - 34s 927ms/step - loss: 0.0822 - acc: 0.9702 - val_loss: 1.0062 - val_acc: 0.8101 Epoch 80/100 37/37 [==============================] - 34s 926ms/step - loss: 0.0918 - acc: 0.9672 - val_loss: 0.9332 - val_acc: 0.8028 Epoch 81/100 37/37 [==============================] - 34s 923ms/step - loss: 0.0803 - acc: 0.9722 - val_loss: 0.9836 - val_acc: 0.8032 Epoch 82/100 37/37 [==============================] - 34s 928ms/step - loss: 0.0738 - acc: 0.9734 - val_loss: 1.0202 - val_acc: 0.8041 Epoch 83/100 37/37 [==============================] - 34s 926ms/step - loss: 0.0841 - acc: 0.9708 - val_loss: 0.9098 - val_acc: 0.8045 Epoch 84/100 37/37 [==============================] - 34s 927ms/step - loss: 0.0857 - acc: 0.9679 - val_loss: 0.9171 - val_acc: 0.7998 Epoch 85/100 37/37 [==============================] - 34s 930ms/step - loss: 0.0678 - acc: 0.9763 - val_loss: 0.9859 - val_acc: 0.8088 Epoch 86/100 37/37 [==============================] - 34s 932ms/step - loss: 0.0741 - acc: 0.9720 - val_loss: 0.9796 - val_acc: 0.8024 Epoch 87/100 37/37 [==============================] - 34s 930ms/step - loss: 0.0762 - acc: 0.9725 - val_loss: 1.0087 - val_acc: 0.8062 Epoch 88/100 37/37 [==============================] - 34s 932ms/step - loss: 0.0720 - acc: 0.9740 - val_loss: 0.9812 - val_acc: 0.7985 Epoch 89/100 37/37 [==============================] - 35s 934ms/step - loss: 0.0728 - acc: 0.9733 - val_loss: 0.9725 - val_acc: 0.7921 Epoch 90/100 37/37 [==============================] - 34s 926ms/step - loss: 0.0787 - acc: 0.9709 - val_loss: 0.9953 - val_acc: 0.8003 Epoch 91/100 37/37 [==============================] - 34s 931ms/step - loss: 0.0661 - acc: 0.9752 - val_loss: 1.0648 - val_acc: 0.7900 Epoch 92/100 37/37 [==============================] - 34s 930ms/step - loss: 0.0715 - acc: 0.9726 - val_loss: 0.9771 - val_acc: 0.7960 Epoch 93/100 37/37 [==============================] - 34s 927ms/step - loss: 0.0735 - acc: 0.9726 - val_loss: 1.0433 - val_acc: 0.7964 Epoch 94/100 37/37 [==============================] - 34s 932ms/step - loss: 0.0636 - acc: 0.9775 - val_loss: 0.9599 - val_acc: 0.7985 Epoch 95/100 37/37 [==============================] - 34s 926ms/step - loss: 0.0610 - acc: 0.9782 - val_loss: 1.0833 - val_acc: 0.7879 Epoch 96/100 37/37 [==============================] - 34s 924ms/step - loss: 0.0709 - acc: 0.9750 - val_loss: 0.9825 - val_acc: 0.7960 Epoch 97/100 37/37 [==============================] - 34s 926ms/step - loss: 0.0609 - acc: 0.9775 - val_loss: 1.0165 - val_acc: 0.7964 Epoch 98/100 37/37 [==============================] - 34s 927ms/step - loss: 0.0573 - acc: 0.9801 - val_loss: 1.0391 - val_acc: 0.7977 Epoch 99/100 37/37 [==============================] - 34s 925ms/step - loss: 0.0557 - acc: 0.9803 - val_loss: 1.0252 - val_acc: 0.7985 Epoch 100/100 37/37 [==============================] - 34s 927ms/step - loss: 0.0601 - acc: 0.9776 - val_loss: 1.0275 - val_acc: 0.8003
loss, accuracy = gru_model_4.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = gru_model_4.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
Training Accuracy: 0.9570 Testing Accuracy: 0.7770
acc = history_4.history['acc']
val_acc = history_4.history['val_acc']
loss = history_4.history['loss']
val_loss = history_4.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'g', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'g', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
Observation:
Accuracy steadily increases to the point of overfitting in training but peaks at about 0.80 in validation.
Loss decreases steadily in training but initially dips and then increases to over 100%. Not ideal.
# Get the predicted values.
y_pred = gru_model_4.predict(X_test) # Outputs probabilities of each sentiment.
# Create empty numpy array to match length of training observations.
y_pred_array = np.zeros(X_test.shape[0])
# Find the class with highest probability.
for i in range(0, y_pred.shape[0]):
label_predict = np.argmax(y_pred[i]) # Column with max probability.
y_pred_array[i] = label_predict
# C onvert to integers.
y_pred_array = y_pred_array.astype(int)
np.set_printoptions(precision=2)
# Plot the non-normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names,
title='Confusion matrix, without normalization')
# Plot the normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names, normalize=True,
title='Normalized confusion matrix')
plt.show()
Confusion matrix, without normalization [[1604 184 82] [ 197 353 64] [ 64 62 318]] Normalized confusion matrix [[0.86 0.1 0.04] [0.32 0.57 0.1 ] [0.14 0.14 0.72]]
Observations:
We see in the above confusion matrices, Model 4 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance. Model 4 performed better than Model 3 with the neutral labels, a little worse with the positive labels, but performed approximately as well as Model 3 with the negative lables.
Model 5: Reduced GRU with More Regularization.
# GRU Model:
gru_model_5 = Sequential()
gru_model_5.add(embedding_layer)
gru_model_5.add(GRU(64,
dropout = 0.3,
recurrent_dropout = 0.5,
return_sequences = True))
gru_model_5.add(GRU(32,
dropout = 0.2,
recurrent_dropout = 0.5))
gru_model_5.add(Dense(3, activation='softmax'))
gru_model_5.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
gru_model_5.summary()
Model: "sequential_5" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 30, 100) 1576900 _________________________________________________________________ gru_2 (GRU) (None, 30, 64) 31872 _________________________________________________________________ gru_3 (GRU) (None, 32) 9408 _________________________________________________________________ dense_8 (Dense) (None, 3) 99 ================================================================= Total params: 1,618,279 Trainable params: 41,379 Non-trainable params: 1,576,900 _________________________________________________________________
# Hyperparameters for the GRU Model.
history_5 = gru_model_5.fit(X_train, y_train,
validation_split = 0.2,
epochs=100, batch_size=256)
Epoch 1/100 37/37 [==============================] - 13s 223ms/step - loss: 0.9068 - acc: 0.6203 - val_loss: 0.8596 - val_acc: 0.6176 Epoch 2/100 37/37 [==============================] - 8s 205ms/step - loss: 0.8451 - acc: 0.6270 - val_loss: 0.8275 - val_acc: 0.6470 Epoch 3/100 37/37 [==============================] - 8s 203ms/step - loss: 0.7920 - acc: 0.6628 - val_loss: 0.7152 - val_acc: 0.7055 Epoch 4/100 37/37 [==============================] - 8s 208ms/step - loss: 0.7090 - acc: 0.6991 - val_loss: 0.6720 - val_acc: 0.7183 Epoch 5/100 37/37 [==============================] - 8s 207ms/step - loss: 0.6728 - acc: 0.7225 - val_loss: 0.6353 - val_acc: 0.7452 Epoch 6/100 37/37 [==============================] - 8s 206ms/step - loss: 0.6506 - acc: 0.7334 - val_loss: 0.6217 - val_acc: 0.7606 Epoch 7/100 37/37 [==============================] - 8s 206ms/step - loss: 0.6220 - acc: 0.7457 - val_loss: 0.5821 - val_acc: 0.7717 Epoch 8/100 37/37 [==============================] - 8s 206ms/step - loss: 0.6057 - acc: 0.7516 - val_loss: 0.5627 - val_acc: 0.7772 Epoch 9/100 37/37 [==============================] - 8s 206ms/step - loss: 0.5855 - acc: 0.7661 - val_loss: 0.5676 - val_acc: 0.7776 Epoch 10/100 37/37 [==============================] - 8s 207ms/step - loss: 0.5761 - acc: 0.7635 - val_loss: 0.5479 - val_acc: 0.7832 Epoch 11/100 37/37 [==============================] - 8s 208ms/step - loss: 0.5754 - acc: 0.7683 - val_loss: 0.5460 - val_acc: 0.7828 Epoch 12/100 37/37 [==============================] - 8s 208ms/step - loss: 0.5603 - acc: 0.7727 - val_loss: 0.5485 - val_acc: 0.7828 Epoch 13/100 37/37 [==============================] - 8s 209ms/step - loss: 0.5591 - acc: 0.7706 - val_loss: 0.5452 - val_acc: 0.7866 Epoch 14/100 37/37 [==============================] - 8s 208ms/step - loss: 0.5490 - acc: 0.7767 - val_loss: 0.5489 - val_acc: 0.7845 Epoch 15/100 37/37 [==============================] - 8s 209ms/step - loss: 0.5386 - acc: 0.7843 - val_loss: 0.5186 - val_acc: 0.7964 Epoch 16/100 37/37 [==============================] - 8s 209ms/step - loss: 0.5301 - acc: 0.7850 - val_loss: 0.5527 - val_acc: 0.7870 Epoch 17/100 37/37 [==============================] - 8s 208ms/step - loss: 0.5360 - acc: 0.7883 - val_loss: 0.5116 - val_acc: 0.7981 Epoch 18/100 37/37 [==============================] - 8s 205ms/step - loss: 0.5258 - acc: 0.7864 - val_loss: 0.5089 - val_acc: 0.8028 Epoch 19/100 37/37 [==============================] - 8s 209ms/step - loss: 0.5181 - acc: 0.7940 - val_loss: 0.5082 - val_acc: 0.7964 Epoch 20/100 37/37 [==============================] - 8s 211ms/step - loss: 0.5110 - acc: 0.7942 - val_loss: 0.5063 - val_acc: 0.8054 Epoch 21/100 37/37 [==============================] - 8s 208ms/step - loss: 0.5126 - acc: 0.7945 - val_loss: 0.4998 - val_acc: 0.8011 Epoch 22/100 37/37 [==============================] - 8s 209ms/step - loss: 0.5021 - acc: 0.7975 - val_loss: 0.5001 - val_acc: 0.8037 Epoch 23/100 37/37 [==============================] - 8s 209ms/step - loss: 0.4996 - acc: 0.8046 - val_loss: 0.5004 - val_acc: 0.8075 Epoch 24/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4861 - acc: 0.8064 - val_loss: 0.4985 - val_acc: 0.8062 Epoch 25/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4918 - acc: 0.8016 - val_loss: 0.4920 - val_acc: 0.8062 Epoch 26/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4861 - acc: 0.8038 - val_loss: 0.4966 - val_acc: 0.8041 Epoch 27/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4840 - acc: 0.8086 - val_loss: 0.4964 - val_acc: 0.8041 Epoch 28/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4775 - acc: 0.8102 - val_loss: 0.5049 - val_acc: 0.8075 Epoch 29/100 37/37 [==============================] - 8s 216ms/step - loss: 0.4803 - acc: 0.8135 - val_loss: 0.4913 - val_acc: 0.8139 Epoch 30/100 37/37 [==============================] - 8s 228ms/step - loss: 0.4733 - acc: 0.8079 - val_loss: 0.4920 - val_acc: 0.8109 Epoch 31/100 37/37 [==============================] - 8s 215ms/step - loss: 0.4691 - acc: 0.8145 - val_loss: 0.5007 - val_acc: 0.8084 Epoch 32/100 37/37 [==============================] - 8s 206ms/step - loss: 0.4777 - acc: 0.8098 - val_loss: 0.4889 - val_acc: 0.8084 Epoch 33/100 37/37 [==============================] - 8s 206ms/step - loss: 0.4645 - acc: 0.8167 - val_loss: 0.4865 - val_acc: 0.8114 Epoch 34/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4551 - acc: 0.8208 - val_loss: 0.4857 - val_acc: 0.8118 Epoch 35/100 37/37 [==============================] - 8s 209ms/step - loss: 0.4630 - acc: 0.8162 - val_loss: 0.4850 - val_acc: 0.8139 Epoch 36/100 37/37 [==============================] - 8s 209ms/step - loss: 0.4516 - acc: 0.8216 - val_loss: 0.4870 - val_acc: 0.8092 Epoch 37/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4488 - acc: 0.8202 - val_loss: 0.4892 - val_acc: 0.8131 Epoch 38/100 37/37 [==============================] - 8s 209ms/step - loss: 0.4469 - acc: 0.8208 - val_loss: 0.4856 - val_acc: 0.8131 Epoch 39/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4454 - acc: 0.8220 - val_loss: 0.4979 - val_acc: 0.8114 Epoch 40/100 37/37 [==============================] - 8s 210ms/step - loss: 0.4454 - acc: 0.8174 - val_loss: 0.4875 - val_acc: 0.8139 Epoch 41/100 37/37 [==============================] - 8s 209ms/step - loss: 0.4337 - acc: 0.8262 - val_loss: 0.4815 - val_acc: 0.8139 Epoch 42/100 37/37 [==============================] - 8s 209ms/step - loss: 0.4312 - acc: 0.8303 - val_loss: 0.4807 - val_acc: 0.8156 Epoch 43/100 37/37 [==============================] - 8s 211ms/step - loss: 0.4378 - acc: 0.8252 - val_loss: 0.4952 - val_acc: 0.8071 Epoch 44/100 37/37 [==============================] - 8s 211ms/step - loss: 0.4300 - acc: 0.8298 - val_loss: 0.4868 - val_acc: 0.8118 Epoch 45/100 37/37 [==============================] - 8s 215ms/step - loss: 0.4232 - acc: 0.8333 - val_loss: 0.4795 - val_acc: 0.8139 Epoch 46/100 37/37 [==============================] - 8s 207ms/step - loss: 0.4237 - acc: 0.8333 - val_loss: 0.5034 - val_acc: 0.8096 Epoch 47/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4254 - acc: 0.8321 - val_loss: 0.4804 - val_acc: 0.8169 Epoch 48/100 37/37 [==============================] - 8s 209ms/step - loss: 0.4163 - acc: 0.8345 - val_loss: 0.4899 - val_acc: 0.8169 Epoch 49/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4193 - acc: 0.8358 - val_loss: 0.4808 - val_acc: 0.8169 Epoch 50/100 37/37 [==============================] - 8s 210ms/step - loss: 0.4172 - acc: 0.8356 - val_loss: 0.5043 - val_acc: 0.8058 Epoch 51/100 37/37 [==============================] - 8s 209ms/step - loss: 0.4194 - acc: 0.8353 - val_loss: 0.4770 - val_acc: 0.8148 Epoch 52/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4164 - acc: 0.8346 - val_loss: 0.4830 - val_acc: 0.8135 Epoch 53/100 37/37 [==============================] - 8s 210ms/step - loss: 0.4096 - acc: 0.8366 - val_loss: 0.4813 - val_acc: 0.8169 Epoch 54/100 37/37 [==============================] - 8s 209ms/step - loss: 0.4113 - acc: 0.8412 - val_loss: 0.4816 - val_acc: 0.8122 Epoch 55/100 37/37 [==============================] - 8s 208ms/step - loss: 0.4052 - acc: 0.8377 - val_loss: 0.4778 - val_acc: 0.8178 Epoch 56/100 37/37 [==============================] - 8s 207ms/step - loss: 0.4051 - acc: 0.8418 - val_loss: 0.4801 - val_acc: 0.8199 Epoch 57/100 37/37 [==============================] - 8s 210ms/step - loss: 0.4025 - acc: 0.8415 - val_loss: 0.4968 - val_acc: 0.8131 Epoch 58/100 37/37 [==============================] - 8s 210ms/step - loss: 0.4046 - acc: 0.8402 - val_loss: 0.4819 - val_acc: 0.8165 Epoch 59/100 37/37 [==============================] - 8s 212ms/step - loss: 0.4021 - acc: 0.8441 - val_loss: 0.4892 - val_acc: 0.8105 Epoch 60/100 37/37 [==============================] - 8s 211ms/step - loss: 0.3969 - acc: 0.8415 - val_loss: 0.4757 - val_acc: 0.8173 Epoch 61/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3905 - acc: 0.8463 - val_loss: 0.5009 - val_acc: 0.8152 Epoch 62/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3922 - acc: 0.8458 - val_loss: 0.4845 - val_acc: 0.8207 Epoch 63/100 37/37 [==============================] - 8s 213ms/step - loss: 0.3897 - acc: 0.8462 - val_loss: 0.4752 - val_acc: 0.8165 Epoch 64/100 37/37 [==============================] - 8s 211ms/step - loss: 0.3905 - acc: 0.8492 - val_loss: 0.4975 - val_acc: 0.8126 Epoch 65/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3929 - acc: 0.8452 - val_loss: 0.4894 - val_acc: 0.8148 Epoch 66/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3848 - acc: 0.8480 - val_loss: 0.4870 - val_acc: 0.8156 Epoch 67/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3795 - acc: 0.8529 - val_loss: 0.4863 - val_acc: 0.8118 Epoch 68/100 37/37 [==============================] - 8s 211ms/step - loss: 0.3934 - acc: 0.8476 - val_loss: 0.4848 - val_acc: 0.8135 Epoch 69/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3770 - acc: 0.8514 - val_loss: 0.5055 - val_acc: 0.8135 Epoch 70/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3799 - acc: 0.8499 - val_loss: 0.4885 - val_acc: 0.8178 Epoch 71/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3736 - acc: 0.8490 - val_loss: 0.4886 - val_acc: 0.8135 Epoch 72/100 37/37 [==============================] - 8s 211ms/step - loss: 0.3774 - acc: 0.8516 - val_loss: 0.4929 - val_acc: 0.8143 Epoch 73/100 37/37 [==============================] - 8s 207ms/step - loss: 0.3775 - acc: 0.8512 - val_loss: 0.4925 - val_acc: 0.8118 Epoch 74/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3721 - acc: 0.8524 - val_loss: 0.4887 - val_acc: 0.8186 Epoch 75/100 37/37 [==============================] - 8s 211ms/step - loss: 0.3680 - acc: 0.8551 - val_loss: 0.4885 - val_acc: 0.8143 Epoch 76/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3703 - acc: 0.8533 - val_loss: 0.5027 - val_acc: 0.8109 Epoch 77/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3793 - acc: 0.8504 - val_loss: 0.4917 - val_acc: 0.8109 Epoch 78/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3631 - acc: 0.8593 - val_loss: 0.4913 - val_acc: 0.8207 Epoch 79/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3671 - acc: 0.8537 - val_loss: 0.4895 - val_acc: 0.8143 Epoch 80/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3635 - acc: 0.8557 - val_loss: 0.4875 - val_acc: 0.8131 Epoch 81/100 37/37 [==============================] - 8s 210ms/step - loss: 0.3655 - acc: 0.8585 - val_loss: 0.4974 - val_acc: 0.8139 Epoch 82/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3638 - acc: 0.8532 - val_loss: 0.4841 - val_acc: 0.8126 Epoch 83/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3652 - acc: 0.8555 - val_loss: 0.4980 - val_acc: 0.8165 Epoch 84/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3573 - acc: 0.8567 - val_loss: 0.4994 - val_acc: 0.8160 Epoch 85/100 37/37 [==============================] - 8s 211ms/step - loss: 0.3518 - acc: 0.8608 - val_loss: 0.5017 - val_acc: 0.8143 Epoch 86/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3495 - acc: 0.8635 - val_loss: 0.4994 - val_acc: 0.8126 Epoch 87/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3405 - acc: 0.8666 - val_loss: 0.5053 - val_acc: 0.8118 Epoch 88/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3569 - acc: 0.8576 - val_loss: 0.5021 - val_acc: 0.8169 Epoch 89/100 37/37 [==============================] - 8s 212ms/step - loss: 0.3484 - acc: 0.8623 - val_loss: 0.4968 - val_acc: 0.8084 Epoch 90/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3488 - acc: 0.8631 - val_loss: 0.4977 - val_acc: 0.8131 Epoch 91/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3493 - acc: 0.8642 - val_loss: 0.4997 - val_acc: 0.8156 Epoch 92/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3466 - acc: 0.8627 - val_loss: 0.4952 - val_acc: 0.8126 Epoch 93/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3448 - acc: 0.8614 - val_loss: 0.4968 - val_acc: 0.8178 Epoch 94/100 37/37 [==============================] - 8s 206ms/step - loss: 0.3428 - acc: 0.8623 - val_loss: 0.5094 - val_acc: 0.8105 Epoch 95/100 37/37 [==============================] - 8s 208ms/step - loss: 0.3458 - acc: 0.8672 - val_loss: 0.4885 - val_acc: 0.8105 Epoch 96/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3364 - acc: 0.8676 - val_loss: 0.5098 - val_acc: 0.8118 Epoch 97/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3350 - acc: 0.8692 - val_loss: 0.5016 - val_acc: 0.8173 Epoch 98/100 37/37 [==============================] - 8s 211ms/step - loss: 0.3284 - acc: 0.8716 - val_loss: 0.5316 - val_acc: 0.8143 Epoch 99/100 37/37 [==============================] - 8s 210ms/step - loss: 0.3332 - acc: 0.8664 - val_loss: 0.5088 - val_acc: 0.8169 Epoch 100/100 37/37 [==============================] - 8s 209ms/step - loss: 0.3341 - acc: 0.8671 - val_loss: 0.4991 - val_acc: 0.8114
loss, accuracy = gru_model_5.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = gru_model_5.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
Training Accuracy: 0.8963 Testing Accuracy: 0.7982
acc = history_5.history['acc']
val_acc = history_5.history['val_acc']
loss = history_5.history['loss']
val_loss = history_5.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'g', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'g', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
Observation:
Accuracy steadily increases to the point of overfitting in training but peaks at about 0.80 in validation.
Loss decreases steadily in training but initially dips and then levels out at about 50%.
# Get the predicted values.
y_pred = gru_model_5.predict(X_test) # Outputs probabilities of each sentiment.
# Create empty numpy array to match length of training observations.
y_pred_array = np.zeros(X_test.shape[0])
# Find class with highest probability.
for i in range(0, y_pred.shape[0]):
label_predict = np.argmax(y_pred[i]) # Column with max probability.
y_pred_array[i] = label_predict
# Convert to integers.
y_pred_array = y_pred_array.astype(int)
np.set_printoptions(precision=2)
# Plot the non-normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names,
title='Confusion matrix, without normalization')
# Plot the normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names, normalize=True,
title='Normalized confusion matrix')
plt.show()
Confusion matrix, without normalization [[1618 173 79] [ 171 384 59] [ 54 55 335]] Normalized confusion matrix [[0.87 0.09 0.04] [0.28 0.63 0.1 ] [0.12 0.12 0.75]]
Observations:
We see in the above confusion matrices, Model 5 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance.
Model 6: Bidirectional RNN.
Recurrent Neural Network (RNN):
Unlike feedforward networks that process each input individually and independently, a RNN creates loops between each node in the neural network. This makes it particularly good for sequential data, such as text. It is able to process sequences and create a state which contains information about what the network has seen so far. This is why RNNs are useful for natural language processing, because sentences are decoded word-by-word while keeping memory of the words that came beforehand to give better context for understanding. A RNN allows information from a previous output to be fed as input into the current state. Simply put, we can use previous information to help make a current decision.
Bidirectional RNNs:
In general, RNNs tend to be order or time dependent. They process the time steps in a sequential, unidirectional order. On the other hand, a bidirectional RNN is able to process a sequence in both directions which means it may be able to pick up patterns that would not be noticed using a unidirectional model. Therefore, this type of model is able to improve performance on problems that have a chronological order.
# Import the Tensorflow Biderectional RNN library.
from tensorflow.keras.layers import Bidirectional
# Bidirectional RNNs.
bdrnn_model_6 = Sequential()
bdrnn_model_6.add(embedding_layer)
bdrnn_model_6.add(Bidirectional(LSTM(64,
dropout=0.2,
recurrent_dropout=0.5)))
bdrnn_model_6.add(Dense(3,activation='softmax'))
bdrnn_model_6.summary()
Model: "sequential_6" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 30, 100) 1576900 _________________________________________________________________ bidirectional (Bidirectional (None, 128) 84480 _________________________________________________________________ dense_9 (Dense) (None, 3) 387 ================================================================= Total params: 1,661,767 Trainable params: 84,867 Non-trainable params: 1,576,900 _________________________________________________________________
bdrnn_model_6.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
# Tune the hyperparameters.
history_6 = bdrnn_model_6.fit(X_train, y_train,
validation_split = 0.2,
epochs=100, batch_size=256)
Epoch 1/100 37/37 [==============================] - 16s 306ms/step - loss: 0.8506 - acc: 0.6288 - val_loss: 0.7821 - val_acc: 0.6697 Epoch 2/100 37/37 [==============================] - 11s 292ms/step - loss: 0.7150 - acc: 0.7110 - val_loss: 0.6520 - val_acc: 0.7405 Epoch 3/100 37/37 [==============================] - 11s 291ms/step - loss: 0.6371 - acc: 0.7419 - val_loss: 0.6592 - val_acc: 0.7452 Epoch 4/100 37/37 [==============================] - 11s 290ms/step - loss: 0.6040 - acc: 0.7523 - val_loss: 0.5920 - val_acc: 0.7700 Epoch 5/100 37/37 [==============================] - 11s 291ms/step - loss: 0.5817 - acc: 0.7644 - val_loss: 0.5645 - val_acc: 0.7772 Epoch 6/100 37/37 [==============================] - 11s 290ms/step - loss: 0.5714 - acc: 0.7677 - val_loss: 0.5710 - val_acc: 0.7691 Epoch 7/100 37/37 [==============================] - 11s 291ms/step - loss: 0.5630 - acc: 0.7720 - val_loss: 0.5748 - val_acc: 0.7789 Epoch 8/100 37/37 [==============================] - 11s 290ms/step - loss: 0.5480 - acc: 0.7787 - val_loss: 0.5540 - val_acc: 0.7819 Epoch 9/100 37/37 [==============================] - 11s 291ms/step - loss: 0.5336 - acc: 0.7832 - val_loss: 0.5405 - val_acc: 0.7921 Epoch 10/100 37/37 [==============================] - 11s 294ms/step - loss: 0.5243 - acc: 0.7841 - val_loss: 0.5405 - val_acc: 0.7926 Epoch 11/100 37/37 [==============================] - 11s 292ms/step - loss: 0.5171 - acc: 0.7867 - val_loss: 0.5427 - val_acc: 0.7943 Epoch 12/100 37/37 [==============================] - 11s 292ms/step - loss: 0.5093 - acc: 0.7963 - val_loss: 0.5277 - val_acc: 0.7998 Epoch 13/100 37/37 [==============================] - 11s 293ms/step - loss: 0.4981 - acc: 0.7997 - val_loss: 0.5538 - val_acc: 0.7887 Epoch 14/100 37/37 [==============================] - 11s 294ms/step - loss: 0.4938 - acc: 0.8040 - val_loss: 0.5321 - val_acc: 0.7981 Epoch 15/100 37/37 [==============================] - 11s 292ms/step - loss: 0.4824 - acc: 0.8069 - val_loss: 0.5279 - val_acc: 0.7998 Epoch 16/100 37/37 [==============================] - 11s 293ms/step - loss: 0.4806 - acc: 0.8066 - val_loss: 0.5246 - val_acc: 0.8020 Epoch 17/100 37/37 [==============================] - 11s 295ms/step - loss: 0.4692 - acc: 0.8119 - val_loss: 0.5239 - val_acc: 0.7998 Epoch 18/100 37/37 [==============================] - 11s 292ms/step - loss: 0.4634 - acc: 0.8126 - val_loss: 0.5087 - val_acc: 0.7973 Epoch 19/100 37/37 [==============================] - 11s 295ms/step - loss: 0.4598 - acc: 0.8159 - val_loss: 0.5375 - val_acc: 0.7956 Epoch 20/100 37/37 [==============================] - 11s 294ms/step - loss: 0.4462 - acc: 0.8234 - val_loss: 0.5027 - val_acc: 0.8088 Epoch 21/100 37/37 [==============================] - 11s 295ms/step - loss: 0.4404 - acc: 0.8244 - val_loss: 0.5329 - val_acc: 0.8105 Epoch 22/100 37/37 [==============================] - 11s 297ms/step - loss: 0.4354 - acc: 0.8239 - val_loss: 0.4993 - val_acc: 0.8058 Epoch 23/100 37/37 [==============================] - 11s 295ms/step - loss: 0.4263 - acc: 0.8297 - val_loss: 0.5120 - val_acc: 0.8067 Epoch 24/100 37/37 [==============================] - 11s 294ms/step - loss: 0.4177 - acc: 0.8342 - val_loss: 0.5051 - val_acc: 0.8067 Epoch 25/100 37/37 [==============================] - 11s 295ms/step - loss: 0.4094 - acc: 0.8384 - val_loss: 0.5011 - val_acc: 0.8071 Epoch 26/100 37/37 [==============================] - 11s 294ms/step - loss: 0.4069 - acc: 0.8379 - val_loss: 0.5065 - val_acc: 0.8075 Epoch 27/100 37/37 [==============================] - 11s 295ms/step - loss: 0.4041 - acc: 0.8388 - val_loss: 0.5197 - val_acc: 0.8109 Epoch 28/100 37/37 [==============================] - 11s 297ms/step - loss: 0.3946 - acc: 0.8420 - val_loss: 0.5229 - val_acc: 0.8126 Epoch 29/100 37/37 [==============================] - 11s 295ms/step - loss: 0.3961 - acc: 0.8408 - val_loss: 0.4999 - val_acc: 0.8007 Epoch 30/100 37/37 [==============================] - 11s 296ms/step - loss: 0.3847 - acc: 0.8500 - val_loss: 0.5090 - val_acc: 0.8092 Epoch 31/100 37/37 [==============================] - 11s 295ms/step - loss: 0.3869 - acc: 0.8478 - val_loss: 0.5029 - val_acc: 0.7998 Epoch 32/100 37/37 [==============================] - 11s 294ms/step - loss: 0.3767 - acc: 0.8526 - val_loss: 0.5251 - val_acc: 0.8101 Epoch 33/100 37/37 [==============================] - 11s 299ms/step - loss: 0.3668 - acc: 0.8517 - val_loss: 0.5417 - val_acc: 0.8156 Epoch 34/100 37/37 [==============================] - 11s 295ms/step - loss: 0.3724 - acc: 0.8569 - val_loss: 0.5069 - val_acc: 0.8045 Epoch 35/100 37/37 [==============================] - 11s 294ms/step - loss: 0.3625 - acc: 0.8571 - val_loss: 0.5144 - val_acc: 0.8050 Epoch 36/100 37/37 [==============================] - 11s 294ms/step - loss: 0.3513 - acc: 0.8615 - val_loss: 0.5176 - val_acc: 0.8114 Epoch 37/100 37/37 [==============================] - 11s 293ms/step - loss: 0.3586 - acc: 0.8546 - val_loss: 0.5312 - val_acc: 0.8122 Epoch 38/100 37/37 [==============================] - 11s 295ms/step - loss: 0.3430 - acc: 0.8666 - val_loss: 0.5182 - val_acc: 0.8148 Epoch 39/100 37/37 [==============================] - 11s 293ms/step - loss: 0.3373 - acc: 0.8680 - val_loss: 0.5215 - val_acc: 0.8062 Epoch 40/100 37/37 [==============================] - 11s 293ms/step - loss: 0.3359 - acc: 0.8673 - val_loss: 0.5553 - val_acc: 0.8092 Epoch 41/100 37/37 [==============================] - 11s 295ms/step - loss: 0.3371 - acc: 0.8694 - val_loss: 0.5654 - val_acc: 0.8084 Epoch 42/100 37/37 [==============================] - 11s 294ms/step - loss: 0.3227 - acc: 0.8735 - val_loss: 0.5327 - val_acc: 0.8156 Epoch 43/100 37/37 [==============================] - 11s 294ms/step - loss: 0.3268 - acc: 0.8721 - val_loss: 0.5401 - val_acc: 0.8058 Epoch 44/100 37/37 [==============================] - 11s 295ms/step - loss: 0.3144 - acc: 0.8764 - val_loss: 0.5476 - val_acc: 0.8139 Epoch 45/100 37/37 [==============================] - 11s 294ms/step - loss: 0.3044 - acc: 0.8845 - val_loss: 0.5767 - val_acc: 0.8092 Epoch 46/100 37/37 [==============================] - 11s 294ms/step - loss: 0.3089 - acc: 0.8775 - val_loss: 0.5527 - val_acc: 0.8045 Epoch 47/100 37/37 [==============================] - 11s 292ms/step - loss: 0.3022 - acc: 0.8804 - val_loss: 0.5612 - val_acc: 0.8109 Epoch 48/100 37/37 [==============================] - 11s 293ms/step - loss: 0.2909 - acc: 0.8840 - val_loss: 0.5595 - val_acc: 0.8058 Epoch 49/100 37/37 [==============================] - 11s 296ms/step - loss: 0.2923 - acc: 0.8872 - val_loss: 0.5920 - val_acc: 0.8092 Epoch 50/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2880 - acc: 0.8904 - val_loss: 0.6061 - val_acc: 0.8122 Epoch 51/100 37/37 [==============================] - 11s 297ms/step - loss: 0.2924 - acc: 0.8852 - val_loss: 0.5974 - val_acc: 0.8139 Epoch 52/100 37/37 [==============================] - 11s 297ms/step - loss: 0.2843 - acc: 0.8918 - val_loss: 0.5958 - val_acc: 0.8011 Epoch 53/100 37/37 [==============================] - 11s 296ms/step - loss: 0.2799 - acc: 0.8910 - val_loss: 0.6004 - val_acc: 0.8105 Epoch 54/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2738 - acc: 0.8953 - val_loss: 0.5616 - val_acc: 0.7994 Epoch 55/100 37/37 [==============================] - 11s 297ms/step - loss: 0.2701 - acc: 0.8960 - val_loss: 0.5802 - val_acc: 0.8050 Epoch 56/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2623 - acc: 0.8986 - val_loss: 0.6180 - val_acc: 0.8101 Epoch 57/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2618 - acc: 0.9007 - val_loss: 0.6295 - val_acc: 0.8122 Epoch 58/100 37/37 [==============================] - 11s 296ms/step - loss: 0.2570 - acc: 0.8992 - val_loss: 0.6084 - val_acc: 0.7964 Epoch 59/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2602 - acc: 0.9005 - val_loss: 0.6043 - val_acc: 0.8148 Epoch 60/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2463 - acc: 0.9064 - val_loss: 0.6101 - val_acc: 0.8015 Epoch 61/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2442 - acc: 0.9028 - val_loss: 0.5988 - val_acc: 0.8050 Epoch 62/100 37/37 [==============================] - 11s 296ms/step - loss: 0.2397 - acc: 0.9076 - val_loss: 0.6494 - val_acc: 0.8139 Epoch 63/100 37/37 [==============================] - 11s 296ms/step - loss: 0.2548 - acc: 0.9013 - val_loss: 0.5979 - val_acc: 0.8045 Epoch 64/100 37/37 [==============================] - 11s 294ms/step - loss: 0.2377 - acc: 0.9070 - val_loss: 0.6231 - val_acc: 0.8041 Epoch 65/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2331 - acc: 0.9104 - val_loss: 0.6446 - val_acc: 0.7947 Epoch 66/100 37/37 [==============================] - 11s 297ms/step - loss: 0.2295 - acc: 0.9114 - val_loss: 0.6255 - val_acc: 0.8020 Epoch 67/100 37/37 [==============================] - 11s 301ms/step - loss: 0.2211 - acc: 0.9158 - val_loss: 0.6518 - val_acc: 0.8079 Epoch 68/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2242 - acc: 0.9145 - val_loss: 0.6336 - val_acc: 0.8114 Epoch 69/100 37/37 [==============================] - 11s 294ms/step - loss: 0.2200 - acc: 0.9149 - val_loss: 0.6568 - val_acc: 0.8020 Epoch 70/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2146 - acc: 0.9166 - val_loss: 0.6481 - val_acc: 0.8050 Epoch 71/100 37/37 [==============================] - 11s 294ms/step - loss: 0.2085 - acc: 0.9220 - val_loss: 0.6397 - val_acc: 0.8007 Epoch 72/100 37/37 [==============================] - 11s 296ms/step - loss: 0.2078 - acc: 0.9197 - val_loss: 0.6768 - val_acc: 0.8096 Epoch 73/100 37/37 [==============================] - 11s 294ms/step - loss: 0.2123 - acc: 0.9185 - val_loss: 0.6757 - val_acc: 0.8148 Epoch 74/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2113 - acc: 0.9175 - val_loss: 0.6793 - val_acc: 0.8084 Epoch 75/100 37/37 [==============================] - 11s 296ms/step - loss: 0.1992 - acc: 0.9239 - val_loss: 0.6834 - val_acc: 0.8067 Epoch 76/100 37/37 [==============================] - 11s 295ms/step - loss: 0.2019 - acc: 0.9214 - val_loss: 0.6929 - val_acc: 0.8071 Epoch 77/100 37/37 [==============================] - 11s 297ms/step - loss: 0.2137 - acc: 0.9191 - val_loss: 0.7014 - val_acc: 0.8114 Epoch 78/100 37/37 [==============================] - 11s 294ms/step - loss: 0.2030 - acc: 0.9225 - val_loss: 0.7307 - val_acc: 0.8190 Epoch 79/100 37/37 [==============================] - 11s 294ms/step - loss: 0.1958 - acc: 0.9250 - val_loss: 0.7145 - val_acc: 0.8067 Epoch 80/100 37/37 [==============================] - 11s 296ms/step - loss: 0.1917 - acc: 0.9269 - val_loss: 0.6955 - val_acc: 0.8037 Epoch 81/100 37/37 [==============================] - 11s 293ms/step - loss: 0.1964 - acc: 0.9243 - val_loss: 0.7416 - val_acc: 0.8015 Epoch 82/100 37/37 [==============================] - 11s 294ms/step - loss: 0.1913 - acc: 0.9277 - val_loss: 0.7066 - val_acc: 0.8007 Epoch 83/100 37/37 [==============================] - 11s 295ms/step - loss: 0.1837 - acc: 0.9303 - val_loss: 0.7061 - val_acc: 0.8003 Epoch 84/100 37/37 [==============================] - 11s 295ms/step - loss: 0.1849 - acc: 0.9304 - val_loss: 0.7155 - val_acc: 0.8075 Epoch 85/100 37/37 [==============================] - 11s 297ms/step - loss: 0.1815 - acc: 0.9317 - val_loss: 0.7121 - val_acc: 0.8088 Epoch 86/100 37/37 [==============================] - 11s 295ms/step - loss: 0.1853 - acc: 0.9301 - val_loss: 0.6713 - val_acc: 0.7968 Epoch 87/100 37/37 [==============================] - 11s 297ms/step - loss: 0.1777 - acc: 0.9349 - val_loss: 0.7486 - val_acc: 0.8054 Epoch 88/100 37/37 [==============================] - 11s 296ms/step - loss: 0.1753 - acc: 0.9324 - val_loss: 0.7583 - val_acc: 0.8131 Epoch 89/100 37/37 [==============================] - 11s 296ms/step - loss: 0.1734 - acc: 0.9334 - val_loss: 0.7094 - val_acc: 0.8075 Epoch 90/100 37/37 [==============================] - 11s 294ms/step - loss: 0.1731 - acc: 0.9337 - val_loss: 0.7868 - val_acc: 0.8105 Epoch 91/100 37/37 [==============================] - 11s 293ms/step - loss: 0.1692 - acc: 0.9353 - val_loss: 0.7100 - val_acc: 0.8015 Epoch 92/100 37/37 [==============================] - 11s 293ms/step - loss: 0.1668 - acc: 0.9367 - val_loss: 0.7525 - val_acc: 0.8152 Epoch 93/100 37/37 [==============================] - 11s 293ms/step - loss: 0.1618 - acc: 0.9391 - val_loss: 0.7827 - val_acc: 0.8067 Epoch 94/100 37/37 [==============================] - 11s 296ms/step - loss: 0.1659 - acc: 0.9376 - val_loss: 0.8501 - val_acc: 0.8084 Epoch 95/100 37/37 [==============================] - 11s 294ms/step - loss: 0.1599 - acc: 0.9402 - val_loss: 0.7651 - val_acc: 0.8054 Epoch 96/100 37/37 [==============================] - 11s 296ms/step - loss: 0.1621 - acc: 0.9365 - val_loss: 0.7853 - val_acc: 0.8041 Epoch 97/100 37/37 [==============================] - 11s 294ms/step - loss: 0.1561 - acc: 0.9424 - val_loss: 0.7768 - val_acc: 0.8054 Epoch 98/100 37/37 [==============================] - 11s 293ms/step - loss: 0.1564 - acc: 0.9414 - val_loss: 0.7652 - val_acc: 0.8024 Epoch 99/100 37/37 [==============================] - 11s 295ms/step - loss: 0.1539 - acc: 0.9414 - val_loss: 0.7767 - val_acc: 0.8114 Epoch 100/100 37/37 [==============================] - 11s 293ms/step - loss: 0.1570 - acc: 0.9416 - val_loss: 0.7668 - val_acc: 0.8024
loss, accuracy = bdrnn_model_6.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = bdrnn_model_6.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
Training Accuracy: 0.9472 Testing Accuracy: 0.7814
acc = history_6.history['acc']
val_acc = history_6.history['val_acc']
loss = history_6.history['loss']
val_loss = history_6.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'g', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'g', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
Observation:
Accuracy steadily increases to the point of overfitting in training but peaks at about 0.80 in validation.
Loss decreases steadily in training but initially dips and then ascends to around 80%. Not ideal.
# Get the predicted values.
y_pred = bdrnn_model_6.predict(X_test) # Outputs probabilities of each sentiment.
# Create empty numpy array to match length of training observations.
y_pred_array = np.zeros(X_test.shape[0])
# Find the class with highest probability.
for i in range(0, y_pred.shape[0]):
label_predict = np.argmax(y_pred[i]) # Column with max probability.
y_pred_array[i] = label_predict
# Convert to integers.
y_pred_array = y_pred_array.astype(int)
np.set_printoptions(precision=2)
# Plot the non-normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names,
title='Confusion matrix, without normalization')
# Plot the normalized confusion matrix.
plot_confusion_matrix(y_test_array, y_pred_array, classes=class_names, normalize=True,
title='Normalized confusion matrix')
plt.show()
Confusion matrix, without normalization [[1605 201 64] [ 185 384 45] [ 78 67 299]] Normalized confusion matrix [[0.86 0.11 0.03] [0.3 0.63 0.07] [0.18 0.15 0.67]]
Observations:
We see in the above confusion matrices, Model 6 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance.
https://theweek.com/10things/536518/10-things-need-know-today-february22-2015
https://nlp.stanford.edu/projects/glove/
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
https://realpython.com/python-keras-text-classification/
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
df_Tweets_Stop = df_Tweets_Orig.copy()
df_Tweets_Stop = df_Tweets_Stop[["text","airline_sentiment"]]
#remove the html tags
def strip_html(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()
#expand the contractions
def replace_contractions(text):
"""Replace contractions in string of text"""
return contractions.fix(text)
#remove the numericals present in the text
def remove_numbers(text):
text = re.sub(r'\d+', '', text)
return text
# remove the url's present in the text
def remove_url(text):
text = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',text)
return text
# remove the mentions in the tweets
def remove_mention(text):
text = re.sub(r'@\w+','',text)
return text
def clean_text(text):
text = strip_html(text)
text = replace_contractions(text)
text = remove_numbers(text)
text = remove_url(text)
text = remove_mention(text)
return text
df_Tweets_Stop['text'] = df_Tweets_Stop['text'].apply(lambda x: clean_text(x))
df_Tweets_Stop.head(15)
| text | airline_sentiment | |
|---|---|---|
| 0 | What said. | neutral |
| 1 | plus you have added commercials to the experience... tacky. | positive |
| 2 | I did not today... Must mean I need to take another trip! | neutral |
| 3 | it is really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse | negative |
| 4 | and it is a really big bad thing about it | negative |
| 5 | seriously would pay $ a flight for seats that did not have this playing.\nit is really the only bad thing about flying VA | negative |
| 6 | yes, nearly every time I fly VX this “ear worm” will not go away :) | positive |
| 7 | Really missed a prime opportunity for Men Without Hats parody, there. | neutral |
| 8 | Well, I did not…but NOW I DO! :-D | positive |
| 9 | it was amazing, and arrived an hour early. you are too good to me. | positive |
| 10 | did you know that suicide is the second leading because of death among teens - | neutral |
| 11 | I < pretty graphics. so much better than minimal iconography. :D | positive |
| 12 | This is such a great deal! Already thinking about my nd trip to & I have not even gone on my st trip yet! ;p | positive |
| 13 | I am flying your #fabulous #Seductive skies again! you take all the #stress away from travel | positive |
| 14 | Thanks! | positive |
df_Tweets_Stop['text'] = df_Tweets_Stop.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) # Tokenization of data
df_Tweets_Stop.head(15)
| text | airline_sentiment | |
|---|---|---|
| 0 | [What, said, .] | neutral |
| 1 | [plus, you, have, added, commercials, to, the, experience, ..., tacky, .] | positive |
| 2 | [I, did, not, today, ..., Must, mean, I, need, to, take, another, trip, !] | neutral |
| 3 | [it, is, really, aggressive, to, blast, obnoxious, ``, entertainment, '', in, your, guests, ', faces, &, they, have, little, recourse] | negative |
| 4 | [and, it, is, a, really, big, bad, thing, about, it] | negative |
| 5 | [seriously, would, pay, $, a, flight, for, seats, that, did, not, have, this, playing, ., it, is, really, the, only, bad, thing, about, flying, VA] | negative |
| 6 | [yes, ,, nearly, every, time, I, fly, VX, this, “, ear, worm, ”, will, not, go, away, :, )] | positive |
| 7 | [Really, missed, a, prime, opportunity, for, Men, Without, Hats, parody, ,, there, .] | neutral |
| 8 | [Well, ,, I, did, not…but, NOW, I, DO, !, :, -D] | positive |
| 9 | [it, was, amazing, ,, and, arrived, an, hour, early, ., you, are, too, good, to, me, .] | positive |
| 10 | [did, you, know, that, suicide, is, the, second, leading, because, of, death, among, teens, -] | neutral |
| 11 | [I, <, pretty, graphics, ., so, much, better, than, minimal, iconography, ., :, D] | positive |
| 12 | [This, is, such, a, great, deal, !, Already, thinking, about, my, nd, trip, to, &, I, have, not, even, gone, on, my, st, trip, yet, !, ;, p] | positive |
| 13 | [I, am, flying, your, #, fabulous, #, Seductive, skies, again, !, you, take, all, the, #, stress, away, from, travel] | positive |
| 14 | [Thanks, !] | positive |
# This adds the word "flight" to the list of stopwords and effectively removes it from our
# sentiment dataset for the purposes of model building.
stopwords.append('flight')
lemmatizer = WordNetLemmatizer()
#remove the non-ASCII characters
def remove_non_ascii(words):
"""Remove non-ASCII characters from list of tokenized words"""
new_words = []
for word in words:
new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
new_words.append(new_word)
return new_words
# convert all characters to lowercase
def to_lowercase(words):
"""Convert all characters to lowercase from list of tokenized words"""
new_words = []
for word in words:
new_word = word.lower()
new_words.append(new_word)
return new_words
# Remove the hashtags
def remove_hash(text):
"""Remove hashtags from list of tokenized words"""
new_words = []
for word in words:
new_word = re.sub(r'#\w+','',word)
if new_word != '':
new_words.append(new_word)
return new_words
# Remove the punctuations
def remove_punctuation(words):
"""Remove punctuation from list of tokenized words"""
new_words = []
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
return new_words
# Remove the stop words
def remove_stopwords(words):
"""Remove stop words from list of tokenized words"""
new_words = []
for word in words:
if word not in stopwords:
new_words.append(word)
return new_words
# lemmatize the words
def lemmatize_list(words):
new_words = []
for word in words:
new_words.append(lemmatizer.lemmatize(word, pos='v'))
return new_words
def normalize(words):
words = remove_non_ascii(words)
words = to_lowercase(words)
words = remove_punctuation(words)
words = remove_stopwords(words)
words = lemmatize_list(words)
return ' '.join(words)
df_Tweets_Stop['text'] = df_Tweets_Stop.apply(lambda row: normalize(row['text']), axis=1)
df_Tweets_Stop.head(15)
| text | airline_sentiment | |
|---|---|---|
| 0 | say | neutral |
| 1 | plus add commercials experience tacky | positive |
| 2 | today must mean need take another trip | neutral |
| 3 | really aggressive blast obnoxious entertainment guests face little recourse | negative |
| 4 | really big bad thing | negative |
| 5 | seriously would pay seat play really bad thing fly va | negative |
| 6 | yes nearly every time fly vx ear worm go away | positive |
| 7 | really miss prime opportunity men without hat parody | neutral |
| 8 | well notbut | positive |
| 9 | amaze arrive hour early good | positive |
| 10 | know suicide second lead death among teens | neutral |
| 11 | pretty graphics much better minimal iconography | positive |
| 12 | great deal already think nd trip even go st trip yet p | positive |
| 13 | fly fabulous seductive sky take stress away travel | positive |
| 14 | thank | positive |
# importing all necessary modules
from wordcloud import WordCloud
from wordcloud import STOPWORDS
import matplotlib.pyplot as plt
stopword_list = set(STOPWORDS)
word_lists = df_Tweets_Stop['text']
unique_str = ' '.join(word_lists)
#generate_wordcloud(unique_str)
word_cloud = WordCloud(width = 3000, height = 2500,
background_color ='blue',
stopwords = stopword_list,
min_font_size = 10).generate(unique_str)
# Visualize the WordCloud Plot
# Set wordcloud figure size
plt.figure(1,figsize=(12, 12))
# Show image
plt.imshow(word_cloud)
# Remove Axis
plt.axis("off")
# show plot
plt.show()
Observation:
The word "flight" is no longer present in the dataset.
Feature Generation using CountVectorizer.
# Import CountVectorizer and RegexTokenizer
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import CountVectorizer
# Create Regex tokenizer for removing special symbols and numeric values
regex_tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
# Initialize CountVectorizer object
count_vectorizer = CountVectorizer(lowercase=True,
stop_words='english',
ngram_range = (1,1),
tokenizer = regex_tokenizer.tokenize)
# Fit and transform the dataset
count_vectors = count_vectorizer.fit_transform(df_Tweets_Stop['text'])
Split train and test.
# Import train_test_split
from sklearn.model_selection import train_test_split
# Partition data into training and testing set
from sklearn.model_selection import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(
count_vectors, df_Tweets_Stop['airline_sentiment'], test_size=0.3, random_state=1)
Classification Model Building using Logistic Regression.
# import logistic regression scikit-learn model
from sklearn.linear_model import LogisticRegression
# instantiate the model
logreg = LogisticRegression(solver='lbfgs')
# fit the model with data
logreg.fit(feature_train,target_train)
# Forecast the target variable for given test dataset
predictions = logreg.predict(feature_test)
Evaluate the Classification Model.
# Import metrics module for performance evaluation
from sklearn.metrics import accuracy_score
# Assess model performance using accuracy measure
print("Logistic Regression Model Accuracy:",accuracy_score(target_test, predictions))
Logistic Regression Model Accuracy: 0.7748178506375227
Classification using TF-IDF.
# Import TfidfVectorizer and RegexTokenizer
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create Regex tokenizer for removing special symbols and numeric values
regex_tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
# Initialize TfidfVectorizer object
tfidf = TfidfVectorizer(lowercase=True,stop_words='english',ngram_range = (1,1),tokenizer = regex_tokenizer.tokenize)
# Fit and transform the dataset
text_tfidf= tfidf.fit_transform(df_Tweets_Stop['text'])
# Import train_test_split
from sklearn.model_selection import train_test_split
# Partition data into training and testing set
from sklearn.model_selection import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(
text_tfidf, df_Tweets_Stop['airline_sentiment'], test_size=0.3, random_state=1)
# import logistic regression scikit-learn model
from sklearn.linear_model import LogisticRegression
# instantiate the model
logreg = LogisticRegression(solver='lbfgs')
# fit the model with data
logreg.fit(feature_train,target_train)
# Forecast the target variable for given test dataset
predictions = logreg.predict(feature_test)
# Import metrics module for performance evaluation
from sklearn.metrics import accuracy_score
# Assess model performance using accuracy measure
print("Logistic Regression Model Accuracy:",accuracy_score(target_test, predictions))
Logistic Regression Model Accuracy: 0.7693533697632058
Observation:
Removing the word "flight" did not improve the accuracy scores using both Logistic Regression and TF-IDF models. It remains approximately 77%.